Search code examples
deflatecompression

DEFLATE (RFC1951) dynamic huffman "incomplete length"


I've been studying RFC1951 and 'puff.c', and have a question about the issue of "incomplete length".

As near I can tell, defining a "dynamic" Huffman code table that allows for more codes than specified by HLIT+257 will produce an error, at least by puff.c. For example, an error is produced by 'puff.c' if, as a simple debugging test, I were to use a Huffman table of all 9-bit codes to define only 257 lit/lens. Is this outcome purposeful or a bug? And can I assume that any "inflator" based on the 'zlib' library will produce the same error?

I can't find any specification in RFC 1951 that should REQUIRE the use of a sufficiently tight Huffman code. Certainly, I can see that using an "under-subscribed" Huffman table might be inefficient, in terms of compression, but I'm not sure why such a table should be prohibited from use.

My interest isn't simply hypothetical. I really want to use an under-subscribed, literal-only, Huffman code (but NOT the example cited above) to compress some application specific images into PNG files. But I want to make sure it will work with any PNG image viewer.


Solution

  • The RFC specifies that the codes are Huffman codes, which by definition are complete codes. (Complete means that all bit patterns are used.)

    zlib will reject incomplete or oversubscribed codes, except in the special case noted in the RFC:

    If only one distance code is used, it is encoded using one bit, not zero bits; in this case there is a single code length of one, with one unused code.

    There the incomplete code 0 for the single symbol, with code 1 unused, is permitted.

    (That, by the way, is unnecessary. If there is only one distance symbol, then you don't need any bits to specify it. You know that that distance symbol must be used with any length. If that symbol needs extra bits, then those extra bits immediately follow the length. But, oh well -- for that case Phil Katz put an extraneous zero bit in every match, and now we're stuck with it.)

    The fact that the RFC even had to note this special case is another clue that incomplete codes are not accepted otherwise.

    There is sort of another exception in deflate, in that the fixed literal/length code is incomplete, with two unused codes at the end.

    The bottom line is, no, you will not be able to use an incomplete code in a dynamic header (except the special case) and expect zlib or any compliant deflate decoder to be able to decode it.

    As for why this strictness is useful, constraints on dynamic headers permit rapid detection of non-deflate streams or corrupted deflate streams. Similarly, a dynamic header with no end code is not permitted by zlib, so as to avoid the case of a bogus dynamic header permitting any following random bits to be decodable forever, never detecting an error. The unused fixed codes also help in this regard, since eventually they trigger an error in random input.

    By the way, if you want to define a fixed, complete Huffman code for your case, it's super simple, and would reduce the size of almost all of your codes by one bit. Just encode eight bits for the symbols 0..253, using that symbol number directly as the code (reversing the bits of course), and nine bits for symbols 254..257, using the codes 508..511 (bits reversed).