hadoop amazon-web-services amazon-s3 emr amazon-emr

Combining AWS EMR output

I ran a test AWS EMR job with a custom mapper but with NONE as the reducer. I got the (expected) output in 13 separate "part" files. How can I combine them into a single file?

I don't need to aggregate data in any special way, and I don't care if it is sorted, re-ordered arbitrarily, or left in order. But I would like to efficiently put the data back into a single file. Do I have to do that manually, or is there a way to do it as part of the EMR Cluster?

It's very strange to me that there isn't a default option or some sort of automatic step available for this. I've read a bit about the Identity Reducer. Does it do what I want, and if so, how do I use it when launching a cluster through the EMR console?

My data is in S3.

EDIT

To be very clear, I can run cat on all of the output parts after the job is done, if that's what I have to do. Locally, or on an EC2 instance, or whatever. Is that really what everyone does?

Solution

If the output of the mapper part files itself are small then you could try using hadoop fs -getmerge to merge them to local filesystem:

hadoop fs -getmerge s3n://BUCKET/path/to/output/ [LOCAL_FILE]

And then put the merged file back to S3:

hadoop fs -put [LOCAL_FILE] s3n://BUCKET/path/to/put/

For the above commands to work you should have the following properties set in core-site.xml

<property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>YOUR_ACCESS_KEY</value>
</property>

<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>YOUR_SECRET_ACCESS_KEY</value>
</property>