hadoop amazon-web-services amazon-s3 mapreduce elastic-map-reduce

Amazon EMR: "no output" found in S3

I am not getting any output in S3 when I run a job in Amazon EMR.

I specified the arguments:

-inputfile s3n://exdsyslab/data/file.txt -outputdir s3n://exdsyslab/output

When I checked the job log, I see that the job has completed successfully. But there is no output in the output folder of my bucket exdsyslab.

I also tried one more thing.

I chained two jobs: specified args while creating job flow:

-inputfile s3n://exdsyslab/data/file.txt -outputdir s3n://exdsyslab/result -outputdir1 s3n://exdsyslab/result1

The second job's input is the output of the first job.

I faced the following exception for the second job as the program was running:

The output folder, "result", already exists.

This happened because the directory was created by the first job in the chain. How do I specify the input and output for the second job in the mapreduce chain?

Why is there output in the s3 buckets specified in the arguments?

Solution

For correct output, use this:

-inputfile s3n://exdsyslab/data/file.txt -output s3n://exdsyslab/output

Note that the output directory is specified by "-output".

For chaining jobs: you can't do it the way you specified, you MUST create multiple steps to an existing job in order to execute it. This other answer may help you: https://stackoverflow.com/a/11109592/1203129

For your specific case, the input/output directories have to look like this:

Step 1:

 -inputfile s3n://exdsyslab/data/file.txt -output s3n://exdsyslab/result

Step 2:

 -input s3n://exdsyslab/result -output s3n://exdsyslab/result1