Search code examples
apache-sparkpysparkemr

Running python spark on EMR


We're having a hard time running a python spark job on EMR.

aws emr add-steps --cluster-id j-XXXXXXXX --steps \
Type=CUSTOM_JAR,Name="Spark Program",\
Jar="command-runner.jar",ActionOnFailure=CONTINUE,\ 
Args=["spark-submit",--deploy-mode,cluster,--master,yarn,s3://XXXXXXX/pi.py,2]

We're running the same pyspark compute pi script as the AWS page suggests

This script runs, but it runs forever calculating pi. On local machine it takes seconds to finish. We've tried client mode as well. On client mode it makes us transfer the files locally.

16/09/20 15:20:32 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1474384831795
     final status: UNDEFINED
     tracking URL: http://XXXXXXX.ec2.internal:20888/proxy/application_1474381572045_0002/
     user: hadoop
16/09/20 15:20:33 INFO Client: Application report for application_1474381572045_0002 (state: ACCEPTED)
Repeats this last command over and over...

Does anyone know how to run the example python spark pi script on EMR without it running forever?


Solution

  • When you see the job in ACCEPTED state forever, it means that it is not actually running but rather is waiting for YARN to have enough resources to run the application. Usually this is because you already have some other YARN application running and taking up resources. The easiest way to find out if this is the case is to look at the YARN ResourceManager on port 8088 of the master node. You can also run the command "yarn application -list" if you have ssh'ed to the master node.