Search code examples
binary-treehuffman-codedeflate

Size constraints on DEFLATE Huffman trees


Reading through RFC 1951, it seems like a valid dynamic Huffman tree is not necessarily a full binary tree - for example, the tree specified by the bit lengths (2, 2, 2) has a node with a single child:

    ____
   /    \
  0      1
 / \    /
0   1  0

This causes an issue when attempting to allocate enough memory for the full tree in one allocation, as there is not necessarily any upper bound on the number of nodes in an arbitrary binary tree.

Do the constraints in the DEFLATE standard imply an upper bound on the size of a dynamic Huffman tree?


Solution

  • RFC 1951 defines "Huffman code" to be what is produced in the construction of an optimal prefix code. Therefore the Huffman codes used in a valid deflate stream must be complete, in order to be optimal. Your example is not, with the unused code 11. A 1-2-2 code would be complete, with no unused codes.

    There is one exception to this rule in RFC 1951, which is that a solitary distance code is coded as one-bit, not zero bits:

    If only one distance code is used, it is encoded using one bit, not zero bits; in this case there is a single code length of one, with one unused code.

    zlib's inflate rejects any deflate stream with an invalid Huffman code.

    There is one more subtlety, which is that the Huffman codes in deflate are length-limited to 15 bits, enforced in the encoding of the codes. This does not change the fact that the codes need to be complete.

    For a complete prefix code, the number of internal nodes is n-1, where n is the number of symbols coded.

    Due to the length-limit subtlety, there is a limit on the number of nodes even for non-optimal prefix codes. The worst case would be to assign the maximum number of bits to all of the symbols. Then just construct the tree and count the nodes. What you end up with is a flat code tree for the symbols, with the base of that connected to the root with a line of single branch nodes to the root. So actually, not many more nodes than for an optimal prefix code, due to the way a canonical code is constructed.

    As an example, if all 30 distance codes have length 15, the canonical tree looks like this:

    tree with long trunk

    Instead of 29 internal nodes for an optimal prefix code, this one has 40 internal nodes.