Imagine an AWS Data Pipeline setup that contains only the following:
myEmrActivity1
and myEmrActivity2
that takes command-runner.jar, spark-submit and a few other arguments like Python version to use. The arguments are different for each activity.so, for example, MyEmrActivity1
runs a spark job that calculates the total number of absences for a given year, so an example parameter for the EmrActivity for that job might be:
myEmrActivity1: command-runner.jar,spark-submit,--master,yarn-cluster,--deploy-mode,cluster,PYTHON=python36,s3://amznhadoopactivity/school-attendance-python36/calculate_attendance_for_year.py,2017
where 2017 indicates the year supplied to the Python script.
However, the structure of HadoopActivity
is a bit different than it is for EmrActivity
. HadoopActivity
takes a Jar URI that I've filled out with s3://dynamodb-emr-<region>/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar
, with my region inserted - let's call that myHadoopActivity1
. However, I don't understand how exactly to link a step to an activity like I did with the Parameters - how would I recreate the behavior I set up with the EmrActivity in Data Pipeline with a HadoopActivity object instead? Should I be using a different .jar file?
It turns out this was pretty easy to accomplish, albeit not obvious. First things first - I should have been using a different .jar URI: /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar. After changing that, the next steps were pretty straightforward. If you were running:
command-runner.jar,spark-submit,--master,yarn-cluster
as an EMR activity, just add a HadoopActivity, put in the .jar mentioned above, then add additional arguments to replicate the behavior of the "steps" of an EmrActivity:
Argument: command-runner.jar
Argument: spark-submit
Argument: --master
Argument: yarn-cluster
so on and so forth. So, not that difficult, but it's not also not obvious. Hope this helps someone in the future.