Search code examples
deflate

decoding deflate blocks after (HCLEN + 4) x 3 bits


So in the deflate algorithm each block starts off with a 3 bit header:

     Each block of compressed data begins with 3 header bits
     containing the following data:

        first bit       BFINAL
        next 2 bits     BTYPE

Assuming BTYPE is 10 (compressed with dynamic Huffman codes) then the next 14 bits are as follows:

           5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
           5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
           4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)

The next (HCLEN + 4) x 4 bits represent the code lengths.

What happens after that is less clear to me.

RFC1951 § 3.2.7. Compression with dynamic Huffman codes (BTYPE=10) says this:

           HLIT + 257 code lengths for the literal/length alphabet,
              encoded using the code length Huffman code

           HDIST + 1 code lengths for the distance alphabet,
              encoded using the code length Huffman code

Doing infgen -ddis on 1589c11100000cc166a3cc61ff2dca237709880c45e52c2b08eb043dedb78db8851e (produced by doing gzdeflate('A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED')) gives this:

zeros 65                ! 0110110 110
lens 3                  ! 0
lens 3                  ! 0
lens 4                  ! 101
lens 3                  ! 0
lens 3                  ! 0
zeros 25                ! 0001110 110
lens 3                  ! 0
zeros 138               ! 1111111 110
zeros 22                ! 0001011 110
lens 4                  ! 101
lens 3                  ! 0
lens 3                  ! 0
zeros 3                 ! 000 1111
lens 2                  ! 100
lens 0                  ! 1110
lens 0                  ! 1110
lens 2                  ! 100
lens 2                  ! 100
lens 3                  ! 0
lens 3                  ! 0

I note that 65 is the hex encoding of "A" in ASCII, which presumably explains "zeros 65".

"lens" occurs 16 times, which is equal to HCLEN + 4.

In RFC1951 § 3.2.2. Use of Huffman coding in the "deflate" format there's this:

     2)  Find the numerical value of the smallest code for each
         code length:

            code = 0;
            bl_count[0] = 0;
            for (bits = 1; bits <= MAX_BITS; bits++) {
                code = (code + bl_count[bits-1]) << 1;
                next_code[bits] = code;
            }

So maybe that's what "zeros 65" is but then what about "zeros 25", "zeros 138" and "zeros 22"? 25, 138 and 22, in ASCII, do not appear in the compressed text.

Any ideas?


Solution

  • The next (HCLEN + 4) x 3 bits represent the code lengths.

    The number of lens's has nothing to do with HCLEN. The sequence of zeros and lens represent the 269 (259+10) literal/length and distance codes code lengths. If you add up the zeros and the number of lens, you get 269.

    A zero-length symbol means it does not appear in the compressed data. There are no literal bytes in the data in the range 0..64, so it starts with 65 zeros. The first symbol coded is then an 'A', with length 3.