I have a spark job running on kubernetes using the spark-on-k8s-operator. This job usually takes less than 5 minutes to complete but sometimes I'm having a problem of job stuck because of executors lost that I'm still investigating.
How can I specify a timeout in Spark to make the driver kill all the executors and itself if the execution exceed the specified timeout ?
spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout
The timeout in seconds to wait to acquire a new executor and schedule a task before aborting a TaskSet which is unschedulable because all executors are excluded due to task failures.
from https://spark.apache.org/docs/latest/configuration.html
As I'm aware, the Spark helm chart doesn't offer the spark.scheduler.excludeOnFailure.unschedulableTaskSetTimeout
configuration option