Search code examples
javagoogle-cloud-dataproc

Diagnosing Errors in Dataproc Create Cluster operation (Java library)


When attempting to create a cluster using Google Dataproc, the result appears to initially return successfully, but then a subsequent "get" for the cluster informs me that the cluster went immediately from "Creating" to "Error" state. Unfortunately, attempting to invoke the Diagnostics call did not seem to help.

Here is what I am doing (some liberty has been taken to present the code with hard-coded strings instead of values obtained either via api or through configuration properties):

String projectId = "wide-isotope-147019";
String region = "us-central1-f"
GceClusterConfig computeEngineConfig = new GceClusterConfig();
computeEngineConfig.setZoneUri(
    String.format(ZONE_URI_FORMAT, config.getProjectid(),
                  config.getRegion())
List<String> tagList = new ArrayList<>();
tagList.add("ClusterName: mrfoo");
computeEngineConfig.setTags(tagList);

String machineType = String.format(MACHINE_TYPE_URI_FORMAT,
    projectId, region, "n1-standard-1");
InstanceGroupConfig masterConfig = new InstanceGroupConfig();
masterConfig.setMachineTypeUri(machineType)
            .setNumInstances(1);
InstanceGroupConfig workerConfig = new InstanceGroupConfig();
workerConfig.setMachineTypeUri(machineType)
            .setNumInstances(1);
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setMasterConfig(masterConfig);
clusterConfig.setWorkerConfig(workerConfig);
List<NodeInitializationAction> installActions = new ArrayList<>();
// no init actions yet. want to get basics working first.
clusterConfig.setInitializationActions(installActions);
Cluster cluster = new Cluster();
cluster.setProjectId();
cluster.setConfig(clusterConfig);
cluster.setClusterName("mrfoo");

Dataproc.Projects.Regions.Clusters.Create createOp = null;
Operation result = null;
try {
    createOp = dataproc.projects().regions().clusters()
                       .create(projectId, "global", cluster);
    createOp.setBearerToken(...);
} catch (IOException ex) {
  // handle ...
}

try {
    result = createOp.execute();
} catch (IOExceptions ex) {
   // handle.
}

return result;

The above generates a "reasonable" result without error. However, later, when I do a get operation:

Dataproc.Projects.RegoinsClsuters.Get getOp = null;
Cluster result = null;
try {
    getOp = dataproc.projects().regions().clusters()
           .get("wide-isotope-147019", "global", "mrfoo");
    getOp.setBearerToken(...);
} catch (IOException ioEx) {
  ...
}
try {
   result = getOp.execute();
} catch (IOException ioEx) {
    ...
}

The process does not generate errors, but it tells us that the state of the Cluster is: (Sorry for the long dump. See the very end where it shows the history as creating but the current status as ERROR).

{"clusterName":"mrfoo","clusterUuid":"<id string>","config":
    {"configBucket":"dataproc-<idstring>",
     "gceClusterConfig":"projectId":"wide-isotope-147019",
    <lots of stuff deleted>
  "status":{"state":"ERROR",
            "stateStartTime":"2016-12-13T00:27:11.143Z"},
   "statusHistory":[
      {"state":"CREATING",
       "stateStartTime":"2016-12-13T00:27:09.947Z"}]}

Solution

  • The general pattern for creating a Dataproc cluster is:

    Operaiton op = createCluster(...);
    while(!op.getDone()) {
        sleep(10);
        op = getOperation(op.getName());
    }
    
    if (op.hasError()) {
       // Display op.getError(); 
    }
    

    From looking at logs, in this particular case, I can say that the issue is that Compute Engine is rejecting the instance tags being passed as they do not match Compute Engine's regex for valid tags: '(?:[a-z](?:[-a-z0-9]{0,61}[a-z0-9])?)'. I've filed a bug so that Dataproc will validate instance tags sooner and raise the error immediately when you attempt to create a cluster instead of setting the error on the operation.