Search code examples
apache-sparkamazon-emramazon-data-pipeline

Data Pipeline failing for EMR Activity


I am trying to run a spark step on AWS Data-pipeline. I am getting the following exception:-

amazonaws.datapipeline.taskrunner.TaskExecutionException: Failed to complete EMR transform. at amazonaws.datapipeline.activity.EmrActivity.runActivity(EmrActivity.java:67) at amazonaws.datapipeline.objects.AbstractActivity.run(AbstractActivity.java:16) at amazonaws.datapipeline.taskrunner.TaskPoller.executeRemoteRunner(TaskPoller.java:136) at amazonaws.datapipeline.taskrunner.TaskPoller.executeTask(TaskPoller.java:105) at amazonaws.datapipeline.taskrunner.TaskPoller$1.run(TaskPoller.java:81) at private.com.amazonaws.services.datapipeline.poller.PollWorker.executeWork(PollWorker.java:76) at private.com.amazonaws.services.datapipeline.poller.PollWorker.run(PollWorker.java:53) at java.lang.Thread.run(Thread.java:748) Caused by: amazonaws.datapipeline.taskrunner.TaskExecutionException: EMR job '@DefaultEmrActivity1_2017-11-20T12:13:08_Attempt=1' with jobFlowId 'j-2E7PU1OK3GIJI' is failed with status 'FAILED' and reason 'Cluster ready after last step completed.'. Step 'df-0693981356F3KEDFQ6GG_@DefaultEmrActivity1_2017-11-20T12:13:08_Attempt=1' is in status 'FAILED' with reason 'null' at amazonaws.datapipeline.cluster.EmrUtil.runSteps(EmrUtil.java:286) at amazonaws.datapipeline.activity.EmrActivity.runActivity(EmrActivity.java:63) ... 7 more

The cluster is getting spun up correctly.

Here is the screenshot of the pipeline:-

screenshot

I think there is some issue with the 'step' in activity. Any input would be helpful.


Solution

  • The issue was that the:- 1) script should have been comma-separated. Something like:-

    command-runner.jar,spark-submit,--deploy-mode,cluster,--class,com.amazon.Main
    

    Link:- http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-emrcluster.html

    2) EmrActivity does not support Staging. So, we cannot use ${INPUT1_STAGING_DIR} in the step instruction. Currently, I have replaced this with the hardcoded S3 URL's.