Search code examples
hadoopcompressiongzipcodecbzip2

Compression/Decompression, what make a codec splittable?


I just cannot get a clear picture of so called 'splittable' codec due to lack of compression knowledge. For instanse, gzip vs. bzip2, I do see that when running a MapReduce against a ~400M gzip file, it only spins up one mapper, while for bzip2, it spins up 4 mappers.

What is the fundamental problem that makes gzip not splittable? I've heard that gzip is not splittable because it stores its 'metadata' in the file header while bzip2 stores 'metadata' for each block.

If this is the only problem prevents gzip from being splittable, why not simply let all gzip decompressor threads first go to the file header to fetch the 'metadata', then each thread takes care of part of the compressed file?

Another question, for bzip2, it breaks the file into multiple blocks. Is it possible that one record (one line if we take text file as example) been splitted into 2 parts and placed in 2 different blocks? Does bzip2 care about the record completeness when creating block?

Many thanks!


Solution

  • A bzip2 stream consists of a series of independently decompressible blocks, where each block starts with a fixed 48-bit sequence: 0x314159265359. (The number may look familiar.) That permits simply searching for that sequence and starting decompression there.

    There is a very small probability of a false positive, since nothing prevents that sequence from appearing in the compressed data in a block.

    A gzip stream normally has no breakpoints at which decompression can start, nor any markers of such points. It is one long, dependent stream.

    However a gzip stream can be constructed with breakpoints and markers if desired. pigz has a -i or --independent option which makes blocks that are independently decompressible, at some small cost in compression ratio. The uncompressed block size is selected using -b or --blocksize, with a 128K default. Between each block are the bytes 00 00 ff ff 00 00 00 ff ff, which can be searched for. Since it is 72 bits, there is a much lower probability of a false positive than searching for a 48-bit identifier.