What kinds of patterns in the data do compression algorithms look for?
For example, do they look for repetition: "ABBBBC" = "AB4C"? Or do they look for more complicated patterns? For example, I know text can be compressed using trees, but it may not work for all data types because some data types may have equal probabilities for each character or symbol. To be clear, I'm wondering about universal compression algorithms like .zip
zip's default compression method, deflate, looks for redundancy in two ways. First, it looks for repeated strings of bytes, from three to 258 bytes in length, as far back as 32768 bytes. Such repeated strings are encoded as a length (3..258) and a distance (1..32768). A length greater than the distance results in more than one copy. E.g. a length of 100 and a distance of 1 repeats the last byte 100 times. So, hello, hello, wheeeeeeeee!
becomes hello,
(7,7) whe
(8,1) !
.
In deflate, the literal bytes, e.g. h
, the lengths, e.g. 7, and end-of-stream symbol, are combined into a single symbol, a literal/length. If it's a length, it is followed by a distance symbol.
The second thing that deflate looks for is statistical. The literal/length symbols are counted, and the more frequent ones are coded in fewer bits, and the less frequent ones are coded in more bits. The same thing is done for the distance codes. This process is done optimally using Huffman coding. For for the distribution of letters in English, an e
might be coded in three bits, and a q
in ten bits.
deflate has a few more tricks:
All compression methods can be considered to consist of a modeling step followed by an entropy coding step. The modeling step uses information known about expected redundancies in the data to model it in a different form that extracts and expresses the actual redundancies found compactly. For lossy compression, that step also includes excising information deemed to be unimportant in the reconstruction of the data. The entropy coding step takes the results of the modeling step and codes it into a sequence of bits, ideally representing the statistics accurately enough such that each bit is almost equally likely be a 0 or a 1.