Search code examples
hadoopzipgziphadoop2hadoop-partitioning

How to deal with .gz input files with Hadoop?


Please allow me to provide a scenario:

hadoop jar test.jar Test inputFileFolder outputFileFolder

where

  • test.jar sorts info by key, time, and place
  • inputFileFolder contains multiple .gz files, each .gz file is about 10GB
  • outputFileFolder contains bunch of .gz files

My question is which is the best way to deal with those .gz file in the inputFileFolder? Thank you!


Solution

  • Hadoop will automatically detect and read .gz files. However as .gz is not a splittable compression format, each file will be read by a single mapper. Your best bet is to use another format such as Snappy, or to decompress, split and re-compress into smaller, block-sized files.