Search code examples
javaapache-sparkjaramazon-emr

Get EMR Cluster ID inside Java Spark application


I have Spark application written in Java and executing it via AWS EMR. I want to get the ID of the EMR cluster inside my Java code. I have tried using below:

String emrClusterID = System.getenv("EMR_CLUSTER_ID");

but it returns null cluster ID. I do not want to use the EMR APIs to get the list of running clusters and then get the ID from there because I do not know inside the code what is the cluster name and there can be multiple clusters in Running state with the same name. So, how can I get the cluster ID of the running cluster from inside the code in Spark Java application?


Solution

  • You can read and parse the JSON file /mnt/var/lib/info/job-flow.json on the EMR servers local filesystem.

    The attribute jobFlowId is the clusterId.

    A basic implementation (missing error handling) could be something like this:

    import java.io.File;
    import java.util.Map;
    import com.fasterxml.jackson.databind.ObjectMapper;
    
    public class EmrInfo {
    
      static final File EMR_JOB_FLOW = new File("/mnt/var/lib/info/job-flow.json");
    
      public static String getEmrId() {
          ObjectMapper mapper = new ObjectMapper();
          Map<?, ?> map = mapper.readValue(EMR_JOB_FLOW, Map.class);
          return map.getOrDefault("jobFlowId", "UNKNOWN_ID");
      }
    }