amazon-web-services apache-spark amazon-emr

how to fetch the stdout of spark job on AWS EMR

I can submit a spark task on AWS EMR with the following command. How do I fetch the stdout of the Spark job?

aws emr add-steps --cluster-id ${CLUSTERID} \
--output json \
--steps Type=spark,Name=${JOB_NAME},Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,--conf,spark.eventLog.enabled=true,--num-executors,1,--executor-cores,1,--executor-memory,2g,--conf,spark.eventLog.dir=${LOG_DIR},s3://bucket/${FILE_NAME},s3://bucket/${FILE_NAME}],ActionOnFailure=CONTINUE

Solution

You have a few options when it comes to viewing the stdout of Spark jobs on EMR. You can see details in the EMR docs on viewing log files either on the cluster primary node or on S3.

Programmatic Access

If you want to view the output programmatically without connecting to the EMR primary node itself, you need to ensure that S3 log archiving is enabled for your cluster and that you have access to that S3 location. There are two options.

You can run your Spark app in client mode and view the step logs in S3 once they're synced. If you configured your EMR cluster to send logs to s3://${S3_BUCKET}/logs/emr for example, the step logs will be in s3://${S3_BUCKET}/logs/emr/${CLUSTER_ID}/steps/${STEP_ID}/. You can then copy the stdout.gz to your system and gunzip it.

aws s3 cp s3://${S3_BUCKET}/logs/emr/${CLUSTER_ID}/steps/${STEP_ID}/stdout.gz - | gunzip

If your Spark app is running in cluster mode, it becomes a little more difficult as you have to figure out the Yarn application ID from the Step logs then look at the Yarn container logs that are sent to S3. This is still doable, you just have a couple extra steps. These container logs also take a little bit longer to show up in S3.

# Parse the step log for the Yarn application ID
YARN_APP_ID=$(aws s3 cp s3://${S3_BUCKET}/logs/emr/${CLUSTER_ID}/steps/${STEP_ID}/stderr.gz - | gunzip | grep "Submitting application application" | grep -oE "application_\d+_\d+")

# Copy the Yarn application stdout log (once it's available)
aws s3 cp s3://${S3_BUCKET}/logs/emr/${CLUSTER_ID}/containers/application_${YARN_APP_ID}/container_${YARN_APP_ID}_01_000001/stdout.gz - | gunzip

The second command assumes that the job was scheduled on the first container and did not fail or retry again, so it may need to be tweaked.

Console Access

If you want to view logs on the EMR console, there are also a few options.

If you've submitted your job with client mode and you have S3 logging enabled, you can view the step stdout logs in the Steps tab of your EMR cluster.

If you haven't enabled S3 logging, you can use the hosted app UIs.

On-cluster application UIs - While the cluster is running, you can access the Spark UI from the Applications tab for your cluster. This launches the Spark History Server and you can navigate to your job, then the Executors tab, and then the stdout link in the Logs column for the driver.
Persistent application UIs - Both while the cluster is running and for 30 days after it's terminated, you can still access the Spark UI from the Applications tab for your cluster by selecting the Persistent application UIs option.

These last two options work the same way regardless of if you've selected Client or Cluster deploy-mode.

Amazon EMR CLI Access

Thanks to this post, I just pushed a change to the Amazon EMR CLI that makes something like this pretty easy for a simple PySpark script. With the new --show-stdout flag, you can run a local PySpark script on EMR like this and it'll show you the stdout.

emr run \
    --entry-point entrypoint.py \
    --cluster-id ${CLUSTER_ID} \
    --s3-code-uri s3://${S3_BUCKET}/pyspark-code/ \
    --wait --show-stdout