Search code examples
phpdeflate

walking through deflate by hand


I'm trying to understand how deflate works. To wit I thought I'd try to decode a deflate encoded string by hand going off of what RFC1951: DEFLATE Compressed Data Format Specification version 1.3 says. I generated the deflate encoded string thusly:

$result = gzdeflate('A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED');

echo bin2hex($result);

That gave me this:

1589c11100000cc166a3cc61ff2dca237709880c45e52c2b08eb043dedb78db8851e

From RFC1951 § 3.2.3. Details of block format:

         Each block of compressed data begins with 3 header bits
         containing the following data:

            first bit       BFINAL
            next 2 bits     BTYPE

...

         BFINAL is set if and only if this is the last block of the data
         set.

         BTYPE specifies how the data are compressed, as follows:

            00 - no compression
            01 - compressed with fixed Huffman codes
            10 - compressed with dynamic Huffman codes
            11 - reserved (error)

0x15 is 0b00010101 in bits. The first 3 bits are 000, which would make BFINAL 0 and BTYPE 00 (no compression).

From later in RFC1951 § 3.2.3. Details of block format:

           if stored with no compression
              skip any remaining bits in current partially
                 processed byte
              read LEN and NLEN (see next section)
              copy LEN bytes of data to output

From RFC1951 § 3.2.4. Non-compressed blocks (BTYPE=00):

     Any bits of input up to the next byte boundary are ignored.
     The rest of the block consists of the following information:

          0   1   2   3   4...
        +---+---+---+---+================================+
        |  LEN  | NLEN  |... LEN bytes of literal data...|
        +---+---+---+---+================================+

     LEN is the number of data bytes in the block.  NLEN is the
     one's complement of LEN.

So per all that I move to the next byte - 0x89, which is 0b10001001 in bits. The first two bits (LEN) are 10 and the next two bits are NLEN (00). But herein lies the problem. 00 isn't the one's complement of 10 - 01 is.

Also, what if you wanted to have an uncompressed block with 0xFF as the first byte? Is that just not possible?


Solution

  • You didn't read this part of the RFC:

    Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the least-significant bit of the byte.

    The first three bits are 101, not 000.

    A decoding of that stream using infgen:

    ! infgen 2.6 output
    !
    last
    dynamic
    count 259 10 16
    code 17 4
    code 18 3
    code 0 4
    code 4 3
    code 3 1
    code 2 3
    zeros 65
    lens 3 3 4 3 3
    zeros 25
    lens 3
    zeros 138
    zeros 22
    lens 4 3 3
    zeros 3
    lens 2 0 0 2 2 3 3
    ! litlen 65 3
    ! litlen 66 3
    ! litlen 67 4
    ! litlen 68 3
    ! litlen 69 3
    ! litlen 95 3
    ! litlen 256 4
    ! litlen 257 3
    ! litlen 258 3
    ! dist 3 2
    ! dist 6 2
    ! dist 7 2
    ! dist 8 3
    ! dist 9 3
    literal 'A_DEAD_D
    match 3 4
    literal 'CEDED_A_B
    match 3 12
    literal 'BABE
    match 4 11
    match 3 28
    match 4 20
    literal 'BAC
    match 4 13
    literal 'D
    end