Search code examples
hadoopamazon-web-servicesamazon-s3emramazon-emr

Combining AWS EMR output


I ran a test AWS EMR job with a custom mapper but with NONE as the reducer. I got the (expected) output in 13 separate "part" files. How can I combine them into a single file?

I don't need to aggregate data in any special way, and I don't care if it is sorted, re-ordered arbitrarily, or left in order. But I would like to efficiently put the data back into a single file. Do I have to do that manually, or is there a way to do it as part of the EMR Cluster?

It's very strange to me that there isn't a default option or some sort of automatic step available for this. I've read a bit about the Identity Reducer. Does it do what I want, and if so, how do I use it when launching a cluster through the EMR console?

My data is in S3.


EDIT

To be very clear, I can run cat on all of the output parts after the job is done, if that's what I have to do. Locally, or on an EC2 instance, or whatever. Is that really what everyone does?


Solution

  • If the output of the mapper part files itself are small then you could try using hadoop fs -getmerge to merge them to local filesystem:

    hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE]
    

    And then put the merged file back to S3:

    hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/
    

    For the above commands to work you should have the following properties set in core-site.xml

    <property>
      <name>fs.s3n.awsAccessKeyId</name>
      <value>YOUR_ACCESS_KEY</value>
    </property>
    
    <property>
      <name>fs.s3n.awsSecretAccessKey</name>
      <value>YOUR_SECRET_ACCESS_KEY</value>
    </property>