Search code examples
amazon-web-servicesamazon-data-pipeline

Translating EmrActivity to HadoopActivity in AWS Data Pipeline


Imagine an AWS Data Pipeline setup that contains only the following:

  • 2 EmrActivities, myEmrActivity1 and myEmrActivity2 that takes command-runner.jar, spark-submit and a few other arguments like Python version to use. The arguments are different for each activity.
  • 2 parameters, one for each EmrActivity

so, for example, MyEmrActivity1 runs a spark job that calculates the total number of absences for a given year, so an example parameter for the EmrActivity for that job might be:

myEmrActivity1: command-runner.jar,spark-submit,--master,yarn-cluster,--deploy-mode,cluster,PYTHON=python36,s3://amznhadoopactivity/school-attendance-python36/calculate_attendance_for_year.py,2017

where 2017 indicates the year supplied to the Python script.

However, the structure of HadoopActivity is a bit different than it is for EmrActivity. HadoopActivity takes a Jar URI that I've filled out with s3://dynamodb-emr-<region>/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar, with my region inserted - let's call that myHadoopActivity1. However, I don't understand how exactly to link a step to an activity like I did with the Parameters - how would I recreate the behavior I set up with the EmrActivity in Data Pipeline with a HadoopActivity object instead? Should I be using a different .jar file?


Solution

  • It turns out this was pretty easy to accomplish, albeit not obvious. First things first - I should have been using a different .jar URI: /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar. After changing that, the next steps were pretty straightforward. If you were running:

    command-runner.jar,spark-submit,--master,yarn-cluster
    

    as an EMR activity, just add a HadoopActivity, put in the .jar mentioned above, then add additional arguments to replicate the behavior of the "steps" of an EmrActivity:

    Argument: command-runner.jar
    Argument: spark-submit
    Argument: --master
    Argument: yarn-cluster
    

    so on and so forth. So, not that difficult, but it's not also not obvious. Hope this helps someone in the future.