java hadoop mapreduce emr elastic-map-reduce

Use gzip input codec on files without .gz extension in hadoop

I'm running a Hadoop job on a bunch of gzipped input files. Hadoop should handle this easily... mapreduce in java - gzip input files

Unfortunately, in my case, the input files don't have a .gz extension. I'm using CombineTextInputFormatClass, which runs my job fine if I point it at non-gzipped files, but I basically just get a bunch of garbage if I point it at the gzipped ones.

I've tried searching for quite some time, but the only thing I've turned up is somebody else asking the same question as I have, with no answer... How to force Hadoop to unzip inputs regadless of their extension?

Anybody got anything?

Solution

Went digging in the source and built a solution for this...

You need to modify the source of the LineRecordReader class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory and calls getCodec which parses a file path for its extension. You can instead use getCodecByClassName to obtain any codec you want.

You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/