I am specifying a NodeInitializationAction for Dataproc as follows:
ClusterConfig clusterConfig = new ClusterConfig();
clusterConfig.setGceClusterConfig(...);
clusterConfig.setMasterConfig(...);
clusterConfig.setWorkerConfig(...);
List<NodeInitializationAction> initActions = new ArrayList<>();
NodeInitializationAction action = new NodeInitializationAction();
action.setExecutableFile("gs://mybucket/myExecutableFile");
initActions.add(action);
clusterConfig.setInitializationActions(initActions);
Then later:
Cluster cluster = new Cluster();
cluster.setProjectId("wide-isotope-147019");
cluster.setConfig(clusterConfig);
cluster.setClusterName("cat");
Then finally, I invoke the dataproc.create operation with the cluster. I can see the cluster being created, but when I ssh into the master machine ("cat-m" in us-central1-f), I see no evidence of the script I specified having been copied over or run.
So this leads to my questions:
Thanks in advance.
Dataproc makes a number of guarantees about init actions:
each script should be downloaded and stored locally in:
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0
the output of the script will be captured in a "staging bucket" (either the bucket specified via --bucket
option, or a Dataproc auto-generated bucket). Assuming your cluster is named my-cluster
, if you describe master instance via gcloud compute instances describe my-cluster-m
, the exact location is in dataproc-agent-output-directory
metadata key
Cluster may not enter RUNNING state (and Operation may not complete) until all init actions execute on all nodes. If init action exits with non-zero code, or init action exceeds specified timeout, it will be reported as such
similarly if you resize a cluster, we guarantee that new workers do not join cluster until each worker is fully configured in isolation
if you still don't belive me :) inspect Dataproc agent log in /var/log/google-dataproc-agent-0.log
and look for entries from BootstrapActionRunner