Search code examples
google-cloud-platformgoogle-cloud-dataproc

Google Cloud new cluster generation failure.


I have been trying to create a new cluster using both the Web UI and the following command: gcloud dataproc clusters create cluster-2 --zone europe-west1-b --master-machine-type n1-standard-1 --master-boot-disk-size 50 --num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 50 --project <project-name>

The cluster consists of one master node and 2 worker nodes and it is a pretty small cluster. The virtual machines are generated and are running properly. However, the cluster generation fails.

The error messages shown during the cluster-generation point me to the file "dataproc-startup-script_output". The only error message I have found is Error: "--max_wait_seconds" does not look like a port in that file.

The number of VMs that I have is 5. Single machines can be created and run successfully. In the recent past (few days ago), I was able to create a cluster with no problems. That cluster is now deleted however. Is there a limit to how many clusters one can create?


Solution

  • To summarize the findings from following up separately by email thread, in general if:

    1. It takes longer than 10 minutes or so to fail, and
    2. You've changed project network settings at all

    Then a potential culprit is network misconfiguration. In general, Dataproc clusters require full internal IP networking access to each other, and it's typical to have a firewall rule in your Google Compute Engine network which opens all of udp:1-65535,tcp:1-65535,icmp but limited to the internal-IP "source IP range".

    In this case, the project was indeed missing a necessary rule due to a minor typo/misconfiguration, where a default-allow-internal rule accidentally limited the source IP range with a complete mask of /32.

    If you're not doing advanced VPN configuration or cross-project networking, re-adding a simple 10.0.0.0/8 udp:1-65535;tcp:1-65535;icmp firewall rule should work. If you're doing more advanced networking, you'll likely want to restrict the source range to be a bit more specific, for example something if your network's IPv4 range is 10.128.0.0/16 then you should set your "allow internal" firewall to use that 10.128.0.0/16 source range as well.

    Adding firewall rules through the cloud console provides convenience methods for selecting applicable Source IP ranges, especially convenient when you have subnetworks that are messy to enumerate manually.