An improvement to using this technique is to have a standard text stream with an escape character that means a run length sequence follows this one. Then you can get your program to check to see if it is worth putting in the run length encode or not. Another improvement that works with bits is you don't store a type, you just assume that between each length the type of the bit changes. I can see that this is as clear as mud.
Example:
Using a byte sized length field and a byte sized
type field. The following file is compressed
using run length encoding.
AAAAABBBBBBCCCCC
Encodes to:
<5>A<6>B<5>C
The numbers inside <>'s are counts, and are one byte long. This compresses a 16 byte file to a 6 byte file. This example is biased a little towards the algorithm we have chosen.
A simple method of handling this is to only allow pointers backwards from the current location and to have a special escape sequence to mean that the next character is a pointer. We store the pointer as a relative distance from the current location and a length of the same string follows it. The one place where this does not do as well is in cases where large numbers of the same byte as stored in a row. The size of the compressed file is logarithmic. We first store the one byte, we store a pointer back to it of length 1, we store a pointer back to the first byte of length 2, then a pointer back to the first byte of length 4, etc.
Example:
This is a nice example of some text to
compress. This is a nice example.
Ok. We start by placing things directly into
the text stream:
This is a nice example of some text to
compress. <Escape char> <23><pointer to the
first char>
The pointer back saves us 23 bytes, we use 3 bytes to store it, so we save 20 bytes. This method gets very good compression on your average text file. One method of implementing such an agorithm is to use a hash table on the last few items in the compression stream. You look up in this hash table each time you get another value from the compresion stream. If you have a match, you check to see how long the match you have is, if not you insert the new value into the table. Even if you do get a match, you still want to insert the new pointer into the hash table so that your backwards pointers are not too long when you look them up.
This is a quick explanation of a couple of the most common compresion methods. If you want to find more information, there are whole books on the subject. Many thanks to Peter Lewis who sent a copy of a simple LZRW algorithm to the UCC mailing list.