apache-spark google-cloud-platform google-cloud-data-fusion

I'm curious about the internal workflow of GCP's Data Fusion

I've used the Google Cloud platform's DataFusion product in developer and Enterprise mode.

For developer mode, there was no dataproc setting (Master node, Worker node).

For enterprise mode, there was a dataproc setting value. (Master node, Worker node)

What I'm curious about is the case of Enterprise mode.

I was able to set values for the Master node and Worker node.

in detail

Enterprise

- Dataproc
- Master.
- Number of masters: 1
- Master Cores: 2vcpu
- Master Memory (GB): 4GB
- Master Disk Size (GB): 1TB
- Worker
- Number of Workers: 2
- Worker Cores: 4vcpu
- Worker Memory (GB): 16GB
- Worker Disk Size (GB): 1.5TB
- VM
- Driver.
- CPU : 2
- Memory: 4GB (=4096MB)
- Executor
- CPU : 2
- Memory : 8GB (=8192MB)

The setting was given as above.

When I created the data-pipeline, I could see that each VM was created.

I'm very curious about the relationship between VM's Driver, Executor, and Dataproc's Worker node.

In fact, DataFusion gives a setting for dataproc. When I create a data-pipeline in the future, It runs the VM instance as a setting for that dataproc. I want to know the relationship between the set value (Driver, Executor) of VM Instance and the value of dataproc.

Solution

Dataproc allows users to create clusters, whereas the driver and executor settings in Cloud Data Fusion allow users to adjust how much of the cluster resources a pipeline run will use.

As such, creating a Dataproc cluster with 3 workers and 1 master will create 4 VMs with the memory and CPUs specified in the Dataproc configuration, whereas the setting the driver/executor CPUs and memory dictates how much of each master/worker VMs CPUs and memory resources a data pipeline job running on the cluster will use.