Search code examples
compressiongzipzlibdeflatelz77

How does DEFLATE optimize this so much?


I am trying to understand the deflate algorithm, and I have read up on Huffman codes as well as LZ77 compression. I was toying around with compression sizes of different strings, and I stumbled across something I could not explain. The string aaa when compressed, both through zlib and gzip, turns out to be the same size as aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (36 as).

Before reading about this I would have assumed the compressor does something like storing 36*a instead of each character individually, but I could not find anywhere in the specifications where that is mentioned.

Using fixed Huffman code yielded the same result, so I assume the space-saving lies in LZ77, but that only uses distance-length pairs. How would that allow a 3 length string to expand by 12 times without increasing in size?

Interrupting the string of as with one or several bs in the middle drastically increases the size. If distance-length pairs is what's doing the job, why could it not just skip over the bs when searching backwards? Or is Huffman codes being utilized and I misunderstood what fixed Huffman codes implies?


Solution

  • The 36 "a"s are effectively run-length encoded by LZ77 by giving the first "a" as a literal, and then a match with a distance of one, and a length of 35. The length can be as much as 258 for deflate.

    Look online for tutorials on LZ77, Huffman coding, and deflate. You can disassemble the resulting compressed data with infgen to get more insight into how the data is being represented.