I'm using SparkAppHandle.Listener
to monitor the state of a submitted PySpark
project using SparkLauncher
. When the program fails, I expect the following state transitions:
Connected -\> Submitting -\> Running -\> Failed
However, the actual state transitions I observe are:
Connected -\> Submitting -\> Running -\> **Finished** -\> Failed
Additionally, when I submit a pure Python script, it immediately transitions to the Lost state.
Questions:
Is it expected to see a Finished state before a Failed state? Under what conditions could this happen? - Why does a pure Python script result in a Lost state immediately? What should I check in my script or cluster configuration to resolve this?
I have implemented a listener using SparkAppHandle.Listener
to capture and print state changes of the Spark job. I also reviewed the Spark logs to understand the sequence of events leading to the state transitions.
Yes, it is expected to see FINISHED state before FAILED in some cases. For example, when running Spark on YARN in client mode, the job monitoring loop ends before the final application master (AM) state is known and hence a FININSHED state is reported first - lacking knowledge about the actual AM state, the monitor assumes the job has finished normally. Then, once the YARN job has finished, the YARN client in the launcher polls the final AM state from the job report and sends a new state update to the launch server (part of SparkLauncher
which listens for various events from the Spark application, including state changes)
A pure Python script doesn't talk properly to the launch server before exiting and hence the state becomes LOST.