Search code examples
aws-sdkamazon-emr

Get status of 'newly-launched' EMR cluster programmatically


I'm following official docs guide to write a Scala script for launching EMR cluster using AWS Java SDK. I'm able to identify 3 major steps needed here:

  1. Instantiating an EMR Client

    I do this using AmazonElasticMapReduceClientBuilder.defaultClient()

  2. Creating a JobFlowRequest

    I create a RunJobFlowRequest object and supply it with JobFlowInstancesConfig (both objects are supplied with appropriate parameters depending on the requirement)

  3. Running JobFlowRequest

    This is done by calling emrClient.runJobFlow(runJobFlowRequest) which returns a RunJobFlowResult object

But RunJobFlowResult object doesn't provide any clue as to whether the cluster was launched successfully or not (with all the given configurations)


Now I'm aware that listClusters() method of the emrClient can be used to get cluster id of the newly-launched cluster through which we can query the state of the cluster using describeCluster() call. However since I'm using a Scala script to perform all this stuff, I need the process to be automated (here looking up the cluster id in the result of getClusters() will have to be done manually)

Is there any way this could be achieved?


Solution

  • You have all the pieces there but haven't quite stitched them together.

    The cluster's id can be retrieved from RunJobFlowResult.getJobFlowId(). (It is a string starting with "j-".) Then you can pass this jobFlowId to DescribeCluster.

    I don't blame you for your confusion though, since it's called "jobFlowId" for some methods (mainly older API methods) and "clusterId" in other methods. They are really the same thing though.