I'm trying to use s3distcp to compy a lot of small gzipped files which unfortunately don't end in a gz
extension. There s3distcp has an outputCodec
argument that can be used to zip the output, but doesn't have a corresponding inputCodec
. I'm trying to use --jobconf
with the hadoop streaming call but it doesn't seem to be doing anything (the output is still gzipped). The command I'm using is
hadoop jar lib/emr-s3distcp-1.0.jar -Dstream.recordreader.compression=gzip \
--src s3://inputfolder --dest hdfs:///data
Any ideas what might be going on? I'm running AWS EMR AMI-3.9.
As you can see in the s3distcp code: https://github.com/netshade/s3distcp/blob/b899910d04a112019ba695f29d3b0b3d9a785603/src/main/java/com/amazon/external/elasticmapreduce/s3distcp/CopyFilesReducer.java line 197, s3distcp depends on file extension to instantiate the InputStream. Then is not possible to set the input format as a parameter.