hadoop zip gzip hadoop2 hadoop-partitioning

How to deal with .gz input files with Hadoop?

Please allow me to provide a scenario:

hadoop jar test.jar Test inputFileFolder outputFileFolder

where

test.jar sorts info by key, time, and place
inputFileFolder contains multiple .gz files, each .gz file is about 10GB
outputFileFolder contains bunch of .gz files

My question is which is the best way to deal with those .gz file in the inputFileFolder? Thank you!

Solution

Hadoop will automatically detect and read .gz files. However as .gz is not a splittable compression format, each file will be read by a single mapper. Your best bet is to use another format such as Snappy, or to decompress, split and re-compress into smaller, block-sized files.