I'm running a Databricks job running few thousands of jobs and stages. When investigating Spark UI after the cluster is terminated, the UI only shows some of the latest jobs/stages instead of showing all of them (2002 in this example).
In my investigation I've found and set the below Spark Conf but still no luck as seen in the attached photo (using below Spark Conf).
spark.sql.ui.retainedExecutions 10000
spark.ui.retainedTasks 1000000
spark.ui.retainedStages 10000
spark.ui.retainedJobs 10000
Is there a way to keep all historical jobs/stages/tasks?
If not, how can one successfully debug a process after cluster termination?
After some research I understood it's a known issue though there is a workaround to be able to view and investigate the full content of the jobs/stages after the cluster is terminated.
Please do the following in order to apply the workaround:
* For simplicity, let's call the large process we want to explore its SparkUI, the "Large Process".
- Confirm cluster logs exist - Make sure to set log delivery to your compute cluster (the one running the "Large Process") in the compute advanced settings. All needed is to define the type of storage (S3 in the provided example) and the path. The path provided will be the main path for all logs, each cluster's run will create a unique subpath.
- Review the cluster log path - On the terminated cluster (of the "Large Process") under Logging" in the configuration tab you should now be able to see the accurate path in which logs were written to. Make sure files exist in the path.
- To view the full Spark History Server, launch a single node cluster using DBR 9.1 (Don't pick a newer version as it would most likely be incompatible). You will replay the logs on this cluster.
Select the instance type based on the size of the event logs that you want to replay.
- Clone the Event Log Replay notebook.
- Attach the Event Log Replay notebook to the single node cluster.
Enter the path to your chosen cluster event logs (the ones from step 2) in the event_log_path field in the notebook. Make sure to enter the accurate path as seen in the below photo.
- Check the notebook's cluster's (the single node cluster) SparkUI to view the "Large Process" terminated run full SparkUI.
- Should you still see missing stages (on processes running a very large amounts of jobs/stages), make sure to add the following Spark Config to the Single Node cluster (You might also need a larger node type as this will consume more memory). Restart the Single Node cluster and run the replay notebook and you should see all of the stages.
spark.ui.retainedTasks 10000000
spark.ui.retainedJobs 1000000
spark.ui.retainedStages 10000000
spark.sql.ui.retainedExecutions 1000
Read more here.