Right now I have an AWS Step Function to create, run, and terminate EMR cluster jobs. I want to add a timeout feature to stop the job and terminate the cluster in the case that a cluster gets stuck or is taking too long to run (e.g. have an input variable "TIMEOUT_AFTER_X_HOURS": 12
passed into the state machine along with the cluster configs which will automatically stop the job and kill the cluster if it still running after 12 hours). Does anyone know how to accomplish this?
Unfortunately you can't dynamically specify the timeout for a state, but you can dynamically tell a Wait state how long it should wait. With that said, I would recommend that you use a Parallel State with two branches and a catch block. The first branch contains a Wait State and a Fail State (your timeout). The other branch contains your normal State Machine logic and a Fail State.
Whenever a branch fails inside a Parallel state, it aborts all running states in the other branches. Luckily you are able to catch these errors in the Parallel State and redirect it to another state depending on which branch failed. Heres an example of what I mean (change the values in the HardCodedInputs state to control which branch fails).
{
"StartAt": "HardCodedInputs",
"States": {
"HardCodedInputs": {
"Type": "Pass",
"Parameters": {
"WaitBranchInput": {
"timeout": 5,
"Comment": "Change the value of timeout"
},
"WorkerBranchInput": {
"SecondsPath": 3,
"Comment": "SecondsPath is used for testing purposes to simulate how long the worker will run"
}
},
"Next": "Parallel"
},
"Parallel": {
"Type": "Parallel",
"End": true,
"Catch": [{
"ErrorEquals": ["TimeoutExpired"],
"ResultPath": "$.ParralelStateOutput",
"Next": "ExecuteIfTimedOut"
}, {
"ErrorEquals": ["WorkerSuccess"],
"ResultPath": "$.ParralelStateOutput",
"Next": "ExecuteIfWorkerSuccesfull"
}],
"Branches": [{
"StartAt": "DynamicTimeout",
"States": {
"DynamicTimeout": {
"Type": "Wait",
"InputPath": "$.WaitBranchInput",
"SecondsPath": "$.timeout",
"Next": "TimeoutExpired"
},
"TimeoutExpired": {
"Type": "Fail",
"Cause": "TimeoutExceeded.",
"Error": "TimeoutExpired"
}
}
},
{
"StartAt": "WorkerState",
"States": {
"WorkerState": {
"Type": "Wait",
"InputPath": "$.WorkerBranchInput",
"SecondsPath": "$.SecondsPath",
"Next": "WorkerSuccessful"
},
"WorkerSuccessful": {
"Type": "Fail",
"Cause": "Throw Worker Success Exception",
"Error": "WorkerSuccess"
}
}
}
]
},
"ExecuteIfTimedOut": {
"Type": "Pass",
"End": true
},
"ExecuteIfWorkerSuccesfull": {
"Type": "Pass",
"End": true
}
}
}