apache-spark spark-streaming high-availability monit fault-tolerance

How to automatically restart a failed node in Spark Streaming?

I'm using Spark on a cluster in Standalone mode.

I'm currently working on a Spark Streaming application. I've added checkpoints for the system in order to deal with the master process suddenly failing and I see that it's working well.

My question is: what happens if the entire node crashes (power failure, hardware error etc), is there a way to automatically identify failing nodes in the cluster and if so restart them on the same machine (or restart them on a different machine instead)

I've looked at monit but it seems to be running on a specific machine and restart failing processes while I need to do the same thing but over nodes. Just to be clear, I don't mind if the restart operation will take a little bit of time but I would prefer it to happen automatically

Is there any way to do this?

Thanks in advance

Solution

Spark Standalone has some support for High-Availability, as described in the official documentation, at least for the master node.

When a worker node dies, Spark will schedule jobs on other nodes, which works more or less with Spark Streaming as well.

Other than that, you need some cluster management and monitoring tools.