I just found that using Amazon's Elastic Map Reduce, I can specify a step to have one of three ActionOnFailure choices:
TERMINATE_JOB_FLOW is the default and obvious - it shuts down the entire cluster upon a failure in the step.
What is the difference between CANCEL_AND_WAIT and CONTINUE? It appears to me that both will keep the cluster running and simply move on to the next step when it is added.
Say you have launched a cluster and added following 3 steps to it:
Now, if Step1
has ActionOnFailure as CANCEL_AND_WAIT
, then in the event on failure of Step1
, it would cancel all the remaining steps and the cluster will get into a Waiting
status. And I guess if you laucng your cluster with --stay-alive
option then this is the default behaviour.
if Step1
has ActionOnFailure as CONTINUE
, then in the event on failure of Step1
, it would continue with the execution of Step2
.
if Step1
has ActionOnFailure as TERMINATE_JOB_FLOW
, then in the event on failure of Step1
, it would shut down the cluster as you mentioned.