apache-spark pyspark hadoop-yarn amazon-emr

Determine where spark program is failing?

Is there anyway to debug a Spark application that is running in a cluster mode? I have a program that has been running successfully for a while, which processes a couple hundred GB at a time. Recently I had some data cause the run to fail due to executors being disconnected. From what I have read, this is likely a memory issue. I'm trying to determine what function/action is causing the memory issue to trigger. I am using Spark on an EMR cluster(which uses YARN), what would be the best way to debug this issue?

Solution

For cluster mode you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.

For client mode you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040 where driverHostName is the Master Node in EMR and 4040 is the default port (this can be changed).

Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/

These are the essential resources with the Spark UI being the main toolkit for your request.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html