apache-spark pyspark google-cloud-dataproc

How dataproc works with google cloud storage?

I am searching for working of google dataproc with GCS. I am using pyspark of dataproc. Data is read from and written to GCS.But unable to figure out best machine types for my use case. Questions

1) Does spark on dataproc copies data to local disk? e.g. If I am processing 2 TB of data, is it ok If I use 4 machine node with 200GB hdd? OR I should at least provide disk that can hold input data?

2) If the local disk is not at all used then is it ok to use high memory low disk instances?

3) If local disk is used then which instance type is good for processing 2 TB of data with minimum possible number of nodes? I mean is good to use SSD ?

Thanks

Manish

Solution

Spark will read data directly into memory and/or disk depending on if you use RDD or DataFrame. You should have at least enough disk to hold all data. If you are performing joins, then amount of disk necessary grows to handle shuffle spill.

This equation changes if you discard significant amount of data through filtering.

Whether you use pd-standard, pd-ssd, or local-ssd comes down to cost and if your application is CPU or IO bound.

Disk IOPS is proportional to disk size, so very small disks are inadvisable. Keep in mind that disk (relative to CPU) is cheap.

Same advice goes for network IO: more CPUs = more bandwidth.

Finally, default Dataproc settings are a reasonable place to start experimenting and tweaking your settings.

Source: https://cloud.google.com/compute/docs/disks/performance