Search code examples
google-cloud-dataprocservice-accountsterraform-provider-gcp

Terraform dataproc cluster setup issues


I am trying to spin up a private dataproc cluster(1m, 2w) in GCP via terraform. It should also require optional components like docker, anaconda and jupyter. Below are my concerns,

  1. I am trying to add image_version and optional_components under software_config as below, Is that doable?
    software_config {

      image_version = "1.4.21-debian9"
      override_properties = {
        "dataproc:dataproc.allow.zero.workers"          = "true"
      }
      optional_components = [ "DOCKER", "ANACONDA", "JUPYTER" ]
    }       
  1. If the above not doable, Is using initialize_actions are my only option, like below?
    initialization_action {
      script      = "gs://dataproc-initialization-actions/conda/install-conda-env.sh"
      timeout_sec = 500
      }
  1. How do I assign permissions/keys to the nodes that are being spun up thru terraform. so users can access the nodes with it once provisioned. I tried to use as below,
    gce_cluster_config {
      tags    = ["env", "test"]
      network = "${google_compute_network.dp-network.name}"
      internal_ip_only = true
      service_account = "[email protected]"
    }

Appreciate your inputs,

Thank you!

Update: I can spin up a cluster without optional-components specified in the software_config. But If i do, then it is failing with a bug and asked me to report it to bug.

gce_cluster_config {
      network               = "${google_compute_network.dataproc-network.name}"
      internal_ip_only      = true
      tags                  = ["env", "staging"]
      zone                  = "${var.zone}"
      service_account       = "${var.service_account}"
      service_account_scopes= [
        "https://www.googleapis.com/auth/monitoring",
        "useraccounts-ro",
        "storage-rw",
        "logging-write",
      ]
    }

    # We can define multiple initialization_action blocks    
    initialization_action {
      script      = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
      timeout_sec = 500
    }
    initialization_action {
      script      = "gs://dataproc-initialization-actions/jupyter/jupyter.sh"
      timeout_sec = 500 
    }

Solution

  • Either 1 or 2 should be fine. What is likely happening is that Terraform provider for Dataproc is out of sync with the API so please file the bug as suggested by the error.

    For 3, there's a bit of confusion here - let me try to clear this up. Users will have access to resources (Clusters) when you grant them IAM bindings. This has nothing to do with how you create the Cluster. Either Editor or Dataproc Editor, or a custom role will allow them to interact with clusters.

    It is a good move to set internal_ip_only since that makes cluster inaccessible from public internet but it also means gcloud compute ssh to individual nodes will not work.

    Finally, any user that has permission to interact with the cluster has essentially the same permissions as the service account. This article explains this https://cloud.google.com/dataproc/docs/concepts/iam/dataproc-principals