Search code examples
hadoopmapreducehadoop2avroparquet

File compression formats and container file formats


It is generally said that any compression format like Gzip, when used along with a container file format like avro and sequence (file formats), will make the compression format splittable.

Does this mean that the blocks in the container format get compressed based on the preferred compression (like gzip) or something else. Can someone please explain this? Thanks!

Well, I think the question requires an update.

Update:

Do we have a straightforward approach to convert a large file in a non-splittable file compression format (like Gzip) into a splittable file (using a container file format such as Avro, Sequence or Parquet) to be processed by MapReduce?

Note: I do not mean to ask for workarounds such as uncompressing the file, and again compressing the data using a splittable compression format.


Solution

  • For Sequence files if you specify BLOCK compression, each block will be compressed using the specified compression codec. Blocks allow Hadoop to split data at the block level, while using compression (where the compression itself isn't splitable) and skip whole blocks without needing to decompress them.

    Most of this is described on the Hadoop wiki: https://wiki.apache.org/hadoop/SequenceFile

    Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.

    For Avro this is all very similar as well: https://avro.apache.org/docs/1.7.7/spec.html#Object+Container+Files

    Objects are stored in blocks that may be compressed. Syncronization markers are used between blocks to permit efficient splitting of files for MapReduce processing.

    Thus, each block's binary data can be efficiently extracted or skipped without deserializing the contents.

    The easiest (and usually fastest) way to convert data from one format into another is to let MapReduce do the work for you. In the example of:

    GZip Text -> SequenceFile

    You would have a map only job that uses the TextInputFormat for input and outputs SequenceFileFormat. This way you get a 1-to-1 conversion on the number of files (add a reduce step if this needs changing) and you do the conversion in parallel if there are lots of files to convert.