I'm running a Hadoop job on a bunch of gzipped input files. Hadoop should handle this easily... mapreduce in java - gzip input files
Unfortunately, in my case, the input files don't have a .gz
extension. I'm using CombineTextInputFormatClass
, which runs my job fine if I point it at non-gzipped files, but I basically just get a bunch of garbage if I point it at the gzipped ones.
I've tried searching for quite some time, but the only thing I've turned up is somebody else asking the same question as I have, with no answer... How to force Hadoop to unzip inputs regadless of their extension?
Anybody got anything?
Went digging in the source and built a solution for this...
You need to modify the source of the LineRecordReader
class to modify how it chooses a compression codec. The default version creates a Hadoop CompressionCodecFactory
and calls getCodec
which parses a file path for its extension. You can instead use getCodecByClassName
to obtain any codec you want.
You'll then need to override your input format class to make it use your new record reader. Details here: http://daynebatten.com/2015/11/override-hadoop-compression-codec-file-extension/