How can I increase failure tolerance on yarn? In a busy cluster my job fails due to too many failures. Most of the failures were due to Executor lost
base by preemption.
If you have preemption enabled you really should be using the external shuffle service to avoid these issues. There's really not much that can be done otherwise.
https://issues.apache.org/jira/browse/SPARK-14209 - JIRA talks about.