Search code examples
google-cloud-platformgoogle-kubernetes-enginegoogle-cloud-dataproc

How should master and worker node be configured for Scalability and High Availability


I'm working on Data Engineering Solution using GCP Dataproc and Kubernetes.

While creating prototype is easy to go, but the question comes for master and worker configuration. The examples from cloud provider says equal configuration for both master and worker.

https://cloud.google.com/ai-platform/training/docs/machine-types

The same is for other cloud provider like AWS and Azure.

Is it possible to have lower configuration for master than worker ? Eg. Master = n1-highcpu-8 Workers = n1-highcpu-16


Solution

  • When you run Dataproc on GKE, the master and worker node size does not actually apply as Kubernetes becomes the resource manager rather than YARN. When you create your GKE cluster, there are various strategies for optimizing for cost and scale for running Dataproc. I'd recommend using Node Auto-provisioning as it will automatically add/remove right-sized nodes depending on the workloads deployed. You set the minimum and maximum size for nodes as well. I believe the minimum size should use 4 CPU machine types.

    When creating standard Dataproc clusters, master and worker nodes can indeed be different types. Factors which help determine the right size for your master nodes include number of worker nodes and number of jobs submitted. Typically you end up having similar CPU config for master and worker nodes and if you have 500+ worker nodes, you'd likely want your master node(s) to have 2x the memory of your worker nodes as they'd have a much larger worker footprint to manage.