Search code examples
pythonamazon-web-servicesamazon-s3airflowamazon-emr

EMR step execution using Airflow failed


I am using Airflow on local environment to add a step in EMR after creation of cluster using below code block.

def add_step(cluster_id,jar_file,step_args):
    print("The cluster id : {}".format(cluster_id))
    print("The step to be added : {}".format(step_args))
    response=client.add_job_flow_steps(
        JobFlowId=cluster_id,
        Steps=[
            {
                'Name':'Test',
                'ActionOnFailure':'CONTINUE',
                'HadoopJarStep':{
                    'Jar':jar_file,
                    'Args':step_args
                }
            },
        ]
    )
    print('EMR Step is added')
    return response['StepIds'][0]

dag=DAG(
    dag_id="EMR_START_DAG",
    description="Trial for EMR start",
    start_date=days_ago(1)
)

EMR_STEP_1=PythonOperator(
    task_id='EMR_STEP_1',
    python_callable=add_step,
    op_kwargs={'cluster_id':'{{ti.xcom_pull("EMR_START")["JobFlowId"]}}',
               'jar_file':'command-runner.jar',
               'step_args':['s3://shell script path']},
    dag=dag
)

The step gets added to EMR successfully, but it is failing with below error.

Exception in thread "main" java.lang.RuntimeException: java.io.IOException: Cannot run program "s3://shell script path" (in directory "."): error=2, No such file or directory
    at com.amazonaws.emr.command.runner.ProcessRunner.exec(ProcessRunner.java:140)
    at com.amazonaws.emr.command.runner.CommandRunner.main(CommandRunner.java:23)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:244)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:158)
Caused by: java.io.IOException: Cannot run program "s3://shell script path" (in directory "."): error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at com.amazonaws.emr.command.runner.ProcessRunner.exec(ProcessRunner.java:93)
    ... 7 more
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 8 more

The files exists at the path. EMR instance profile and Service role has full access on S3. I am not sure what is going wrong.

Shell script has below code

wget -O - https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data | aws s3 cp - s3path

Solution

  • When you run a shell-script in AWS EMR use script-runner.jar instead of command-runner.jar. Please see page Run commands and scripts on an Amazon EMR cluster and the quote below for more information.

    Although in your particular case you may not need to run it in EMR at all. Since what the script does is copying an external dataset to S3 using AWS CLI, you can execute the same CLI command manually on your laptop once if the dataset does not change or, otherwise, pre-install AWS CLI to the Airflow's server and run the script in BashOperator. This approach would avoid the need to create and run potentially expensive EMR cluster for a command that essentially does not utilise the cluster's resources.

    Example : Running a script on a cluster using script-runner.jar

    When you use script-runner.jar, you specify the script that you want to run in your step's list of arguments.

    The following AWS CLI example submits a step to a running cluster that invokes script-runner.jar. In this case, the script called my-script.sh is stored on Amazon S3. You can also specify local scripts that are stored on the master node of your cluster.

    aws emr add-steps \
    --cluster-id j-2AXXXXXXGAPLF \
    --steps Type=CUSTOM_JAR,Name="Run a script from S3 with script->runner.jar",ActionOnFailure=CONTINUE,Jar=s3://us-west->2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://EXAMPLE-DOC-BUCKET/my-script.sh]