Search code examples
python-3.xamazon-web-servicesaws-lambdaamazon-emraws-step-functions

How to terminate an EMR Cluster from Lambda when all Steps are COMPLETE?


I am trying to figure out how I can terminate an EMR cluster successfully once all the steps submitted to it are 'COMPLETED'|'CANCELLED'|'FAILED'|'INTERRUPTED'. There are three Lambda functions.

  • Lambda 1: Does some work and creates EMR. Triggers Lambda 2 by passing steps and cluster ID through event.
  • Lambda 2: Submits the steps received from Lambda 1 to the cluster ID received from the same.
  • Lambda 3: Submits a final step and then should send a request for termination when all steps are 'COMPLETED'|'CANCELLED'|'FAILED'|'INTERRUPTED'.

I've done till Lambda 3's step submission, but unable to do the rest.

I have successfully created EMR through:

conn = boto3.client("emr")
cluster_id = conn.run_job_flow()

submitted steps through:

conn = boto3.client("emr")
action = conn.add_job_flow_steps(JobFlowId=cluster_id, Steps=event["steps"])

Now how can this termination be triggered only on the given condition? I saw the boto3 API doc which has client.terminate_job_flows(), but this function doesn't wait for the steps to finish or fail and directly hits the termination process.

Is there a way to change KeepJobFlowAliveWhenNoSteps from TRUE to FALSE when all my steps are done? Then I think it should automatically turn off the cluster. But going by the API docs, didn't find any option to change this parameter once the run_job_flow() is called.

Hope I was able to convey the issue I faced correctly. Any help?

Note: Using Python 3.8 in AWS Lambda. Each steps are Spark jobs.


Solution

  • I agree with your research. The optimal situation would be to set KeepJobFlowAliveWhenNoSteps to FALSE to have the cluster self-terminate.

    I do notice that the RunJobFlow documentation says:

    If the KeepJobFlowAliveWhenNoSteps parameter is set to TRUE, the cluster transitions to the WAITING state rather than shutting down after the steps have completed.

    Therefore, the Lambda function could check whether the cluster is in the WAITING state and, if so, shutdown the cluster. However, this would take repeated checking.

    It might be possible to submit a final step that calls the EMR API to shutdown the cluster. This means that the cluster is effectively calling for its own termination as a final step. (I haven't tried this concept, but it would be a clean way of performing the shutdown without having to repeatedly check the status.)

    There is also a similar discussion about shutting down idle clusters on this Question: How to terminate AWS EMR Cluster automatically after some time