apache-spark google-cloud-platform google-cloud-dataproc

Role of additional disk on top of default VM sizing

When we create dataproc cluster on VM, we have an option to add additional disk under configure nodes ie. 1) Primary Disk Size/type 2) Number of local SSD's.

for example, the VM n2-standard-4 has 4 cores, 16 GB RAM and 10Gb Standard (Non-SSD) disk by default (reference - https://www.instance-pricing.com/provider=gcp/instance=n2-standard-4).

Question: In Apache spark, when the data on a worker does not fit to RAM, the data spills to disk. As per my understanding, each dataproc VM will have a default allocated disk space, where the spill happens. Trying to follow the why do we need Primary disk and local SSD in addition to default disk. Does the Shuffle happens on Primary disk and local SSD when attached?

Solution

AFAIK, there is no such thing as 10GB default disk for Dataproc VMs, the 10GB in the source you are referencing might be just an example.

In Dataproc clusters, there are mandatory boot disks (which can be pd-standard, pd-balanced or pd-ssd 1) and optional local SSDs 2.

When local SSDs are configured for the cluster, both HDFS and scratch data, such as shuffle outputs, use the local SSDs instead of the boot persistent disks 2. When only standard PDs are configured, disk size should be at least 1TB per worker 3.