I am trying to spin up a private dataproc cluster(1m, 2w) in GCP via terraform. It should also require optional components like docker, anaconda and jupyter. Below are my concerns,
software_config {
image_version = "1.4.21-debian9"
override_properties = {
"dataproc:dataproc.allow.zero.workers" = "true"
}
optional_components = [ "DOCKER", "ANACONDA", "JUPYTER" ]
}
initialization_action {
script = "gs://dataproc-initialization-actions/conda/install-conda-env.sh"
timeout_sec = 500
}
gce_cluster_config {
tags = ["env", "test"]
network = "${google_compute_network.dp-network.name}"
internal_ip_only = true
service_account = "[email protected]"
}
Appreciate your inputs,
Thank you!
Update: I can spin up a cluster without optional-components specified in the software_config. But If i do, then it is failing with a bug and asked me to report it to bug.
gce_cluster_config {
network = "${google_compute_network.dataproc-network.name}"
internal_ip_only = true
tags = ["env", "staging"]
zone = "${var.zone}"
service_account = "${var.service_account}"
service_account_scopes= [
"https://www.googleapis.com/auth/monitoring",
"useraccounts-ro",
"storage-rw",
"logging-write",
]
}
# We can define multiple initialization_action blocks
initialization_action {
script = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
timeout_sec = 500
}
initialization_action {
script = "gs://dataproc-initialization-actions/jupyter/jupyter.sh"
timeout_sec = 500
}
Either 1 or 2 should be fine. What is likely happening is that Terraform provider for Dataproc is out of sync with the API so please file the bug as suggested by the error.
For 3, there's a bit of confusion here - let me try to clear this up. Users will have access to resources (Clusters) when you grant them IAM bindings. This has nothing to do with how you create the Cluster. Either Editor
or Dataproc Editor
, or a custom role will allow them to interact with clusters.
It is a good move to set internal_ip_only
since that makes cluster inaccessible from public internet but it also means gcloud compute ssh
to individual nodes will not work.
Finally, any user that has permission to interact with the cluster has essentially the same permissions as the service account. This article explains this https://cloud.google.com/dataproc/docs/concepts/iam/dataproc-principals