Search code examples
hadoopcompressionhdfsgziphadoop-streaming

How to compress Hadoop directory to single gzip file?


I have a directory that contains lots of files and sub directories that I want to compress and export from hdfs to fs.

I came across this question - Hadoop: compress file in HDFS? , but it seems like it's relevant only to files, and using hadoop-streaming and the GzipCodec gave me no success with directories.

What is the most efficient why to compress HDFS folder into single gzip file?
Thanks in advance.


Solution

  • For a quick, dirty solution, for those of you who don't want to use hadoop-streaming or any MapReduce job for it, I used FUSE and then preform actions on it as traditional filesystem.
    Pay attention that you might don't want to use this as a permanent solution, only for a quick win :)
    Further reading:
    * https://hadoop.apache.org/docs/r1.2.1/streaming.html
    * http://www.javased.com/index.php?api=org.apache.hadoop.io.compress.GzipCodec