Search code examples
apache-sparkapache-spark-standalone

How to make Spark driver resilient to Master restarts?


I have a Spark Standalone (not YARN/Mesos) cluster and a driver app running (in client mode), which talks to that cluster to execute its tasks. However, if I shutdown and restart the Spark master and workers, the driver does not reconnect to the master and resume its work.

Perhaps I am confused about the relationship between the Spark Master and the driver. In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?


Solution

  • In a situation like this, is the Master responsible for reconnecting back to the driver? If so, does the Master serialize its current state to disk somewhere that it can restore on restart?

    The relationship between the Master node and the driver depends on a few factors. First, the driver is the one hosting your SparkContext/StreamingContext and is the in charge of the jobs execution. It is the one that creates the DAG, and holds the DAGScheduler and TaskScheduler which assign stages/tasks respectively. The Master Node may serve as the host for the driver in case you use Spark Standalone and run your job in "Client Mode". That way, the Master also hosts the driver process and if it dies the driver dies as with it. In case "Cluster mode" is used, the driver resides on one of the Worker nodes, and communicates with the Master frequently to get the status of the current running job, send back metadata regarding the status of the completed batches, etc.

    Running on Standalone, if the Master dies and you restart it, the Master does not re-execute the jobs that were previously ran. In order to achieve this, you can create and provide the cluster with an additional Master node, and set it up so ZooKeeper can hold the Masters state, and interchange between the two in case of failure. When you set up the cluster in such a way, the Master knows about it's previously executed jobs and resumes them on your behalf the new Master has taken the lead.

    You can read how to create a standby Spark Master node in the documentation.