Search code examples
gziphuffman-codedeflatelz77

Can DEFLATE only compress duplicate strings up to 32 KiB apart?


According to DEFLATE spec:

  1. Compressed representation overview

A compressed data set consists of a series of blocks, corresponding to successive blocks of input data. The block sizes are arbitrary, except that non-compressible blocks are limited to 65,535 bytes.

Each block is compressed using a combination of the LZ77 algorithm and Huffman coding. The Huffman trees for each block are independent of those for previous or subsequent blocks; the LZ77 algorithm may use a reference to a duplicated string occurring in a previous block, up to 32K input bytes before.

Each block consists of two parts: a pair of Huffman code trees that describe the representation of the compressed data part, and a compressed data part. (The Huffman trees themselves are compressed using Huffman encoding.) The compressed data consists of a series of elements of two types: literal bytes (of strings that have not been detected as duplicated within the previous 32K input bytes), and pointers to duplicated strings, where a pointer is represented as a pair <length, backward distance>. The representation used in the "deflate" format limits distances to 32K bytes and lengths to 258 bytes, but does not limit the size of a block, except for uncompressible blocks, which are limited as noted above.

So pointers to duplicate strings only go back 32 KiB, but since block size is not limited, could the Huffman code tree store two duplicate strings more than 32 KiB apart as the same code? Then is the limiting factor the block size?


Solution

  • The Huffman tree for distances contains codes 0 to 29 (table below); the code 29, followed by 8191 in "plain" bits, means "distance 32768". That's a hard limit in the definition of Deflate. The block size is not limiting. Actually the block size is not stored anywhere: the block is an infinite stream. If you want to stop the block, you send an End-Of-Block code for that.

                                 Distance Codes
                                 --------------
           Extra           Extra             Extra               Extra
      Code Bits Dist  Code Bits  Dist   Code Bits Distance  Code Bits Distance
      ---- ---- ----  ---- ---- ------  ---- ---- --------  ---- ---- --------
        0   0    1      8   3   17-24    16    7  257-384    24   11  4097-6144
        1   0    2      9   3   25-32    17    7  385-512    25   11  6145-8192
        2   0    3     10   4   33-48    18    8  513-768    26   12  8193-12288
        3   0    4     11   4   49-64    19    8  769-1024   27   12 12289-16384
        4   1   5,6    12   5   65-96    20    9 1025-1536   28   13 16385-24576
        5   1   7,8    13   5   97-128   21    9 1537-2048   29   13 24577-32768
        6   2   9-12   14   6  129-192   22   10 2049-3072
        7   2  13-16   15   6  193-256   23   10 3073-4096