Search code examples
compressionhuffman-codedeflate

How are Huffman trees transmitted?


I'm trying to understand how the DEFLATE algorithm works. I found this document published by UC Davis. I don't understand the part where it talks about how Huffman trees are transmitted

Probably the trickiest part of the DEFLATE specification to understand is the way trees are encoded to go along with the data, when that data is compressed with specialized trees.

The trees are transmitted by their codelengths, as previously discussed. The codelengths are put all together into a sequence of numbers between 0 and 15 (the Huffman trees that are created must be kept to codelengths of no more than 15; this is the tricky part, not the part about constraining the order of the elements).

Not all the elements have to be given codelengths; if the last elements of an alphabet are of 0 codelengths, they can and probably should be left out. The number of elements in each of the two alphabets will be transmitted, so the trimmed alphabets go together into a single sequence.

First of all, what does codelength mean exactly and why can it be 0?

Also I didn't understand run-length compression quite well, they mention it right after the last paragraph.

Once this sequence of codelengths is assembled, it is compressed with a form of what is called run-length compression. When several elements in a row have the same codelength (often 0) special symbols may be used to indicate the number of elements with this codelength. Our sequence is now a sequence of numbers between 0 and 18 (possibly with extra bits forming integers to modify base values, as was the case with the length and distance codes).

A Huffman tree is created for this alphabet of 0-18. Sigh. The sequence of 0-18 codes and extra bits is prepared with the Huffman codes replacing the 0-18 elements.


Solution

  • A codelength is the length of the code in bits for that symbol.

    A zero codelength means that that symbol does not appear in the compressed data, so there is no code for that symbol.

    Run-length encoding means, in this case, that a sequence of repeated codelengths, e.g. "7, 7, 7, 7, 7, 7", is replaced by "7, repeat the last length 5 times".