Search code examples
hadoophdfshadoop-streamingcompressionsnappy

Read Snappy Compressed data on HDFS from Hadoop Streaming


I have a folder in my HDFS system that contains text files compressed using Snappy codec.

Normally, when reading GZIP compressed files in a Hadoop Streaming job, the decompression occurs automatically. However, this is not happening when using Snappy compressed data, and I am not able to process the data.

How can I read these files and process them in Hadoop Streaming?

Many thanks in advance.

UPDATE:

If I use the command hadoop fs -text file it works. The problem only happens when using hadoop streaming, the data is not decompressed before passed to my python script.


Solution

  • I think I have an answer to the problem. It would be great if someone can confirm this.

    Browsing the Cloudera blog. I found this article explaining the Snappy codec. As it can be read:

    One thing to note is that Snappy is intended to be used with a container format, like Sequence Files or Avro Data Files, rather than being used directly on plain text, for example, since the latter is not splittable and can’t be processed in parallel using MapReduce.

    Therefore a file compressed in HDFS using Snappy codec can be read using hadoop fs -text but not in a Hadoop Streaming job (MapReduce).