I had a few questions regarding the header of a DEFLATE block, specifically concerning this section:
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
5 Bits: HDIST, # of Distance codes - 1 (1 - 32)
4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
(HCLEN + 4) x 3 bits: code lengths for the code length
alphabet given just above, in the order: 16, 17, 18,
0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
These code lengths are interpreted as 3-bit integers
(0-7); as above, a code length of 0 means the
corresponding symbol (literal/length or distance code
length) is not used.
HLIT + 257 code lengths for the literal/length alphabet,
encoded using the code length Huffman code
HDIST + 1 code lengths for the distance alphabet,
encoded using the code length Huffman code
Does HLIT, HDIST, and HCLEN have to be at least 257, 1, and 4 respectively? For example, if my uncompressed data only consists of 26 distinct bytes, there would only be 26 literal/length codes (provided no length codes/back references are inserted by the LZ77 phase). However, 26 - 257 would yield a negative number, i.e. how would you store that in 5 bits?
In the (HCLEN + 4) x 3 bits
, HLIT + 257 code lengths
, and HDIST + 1 code lengths
sections, if one of the codes are unused, should anything be emitted to the DEFLATE block? For instance, if 14 in the code length codes is unused should three zero bits be emitted, a single zero bit, or nothing?
In the HLIT + 257 code lengths
and HDIST + 1 code lengths
sections, how many bits should each code length be?
Thanks for your help!
Yes. That is the number of codes represented in the list of lengths. If a length is zero, then that symbol does not have a code. So in the example you give, there would be 257 lengths in the header for the literal/length codes, but only 27 of them would be non-zero. (The 27th is for the end-of-block symbol.)
If a code is unused, that portion of the header has either three zero bits, for a code length code symbol, or the Huffman code for zero, for a literal/length or distance symbol.
Those are Huffman codes of variable bit length, described by the code length code.
It says right there in the RFC:
A code length of 0 indicates that the corresponding symbol in
the literal/length or distance alphabet will not occur in the
block, and should not participate in the Huffman code
construction algorithm given earlier.
Also it says right there in what you copied in your own question:
HLIT + 257 code lengths for the literal/length alphabet,
encoded using the code length Huffman code
HDIST + 1 code lengths for the distance alphabet,
encoded using the code length Huffman code