Search code examples
emramazon-data-pipeline

EMR activity using data pipeline for spark job


I am trying to run a Jar file for spark job in data pipeline, but I am not sure what I exactly need to pass in EMR step?


Solution

  • EMR Step is the place you describe how do you want to submit the spark jar.

    When you create a new datapipeline you can choose the option of "build using template" and then pick "run job on an elastic MapReduce cluster".

    now in the EmrActivity you should describe the step you want to submit (you can also run multiple steps if you want).

    you can read this AWS EMR Spark Step Guide to understand what a step is. in short it the place where you describe how to submit the spark job.

    Pay attention though that on datapipeline for some obscure reason you need to replace spaces with ',' on the step. here is an example of a spark step I ran on datapipeline:

    command-runner.jar,spark-submit,--deploy-mode,cluster,--class,com.exelate.main.App,--master,yarn-cluster,--name,<spark job name>,--num-executors,1000,--driver-cores,2,--driver-memory,10g,--executor-memory,16g,--executor-cores,4,<jar location on s3>,<jar arguments>
    

    I left some of my configuration so that you can understand where to use them and I replaced some with <"text"> so that you could switch with your own information