Search code examples
hadoopamazon-web-serviceshadoop-streamingdistcp

Can you use s3distcp with gzipped input?


I'm trying to use s3distcp to compy a lot of small gzipped files which unfortunately don't end in a gz extension. There s3distcp has an outputCodec argument that can be used to zip the output, but doesn't have a corresponding inputCodec. I'm trying to use --jobconf with the hadoop streaming call but it doesn't seem to be doing anything (the output is still gzipped). The command I'm using is

hadoop jar lib/emr-s3distcp-1.0.jar -Dstream.recordreader.compression=gzip \
           --src s3://inputfolder --dest hdfs:///data

Any ideas what might be going on? I'm running AWS EMR AMI-3.9.


Solution

  • As you can see in the s3distcp code: https://github.com/netshade/s3distcp/blob/b899910d04a112019ba695f29d3b0b3d9a785603/src/main/java/com/amazon/external/elasticmapreduce/s3distcp/CopyFilesReducer.java line 197, s3distcp depends on file extension to instantiate the InputStream. Then is not possible to set the input format as a parameter.