google-cloud-platform google-cloud-data-fusion

I have a question about the DataFusion Data Pipeline

I have a question about the DataFusion Data Pipeline.

I'm using the version of the DataFusion enterprise.

When I create a data pipeline in the Studio of DataFusion, you can set the CPU and memory values of the exit and driver directly in config.

Until now, I knew that if I create a data pipeline, I will create one VM instance per data-pipeline.

However, I just saw that as many VMs are created as Worker nodes, Master nodes.

Then, what does CPU and memory of the exit and driver mean when creating the data-pipeline?

Solution

For a Spark pipeline run, Data Fusion will start one driver with multiple executors, usually corresponding to the number of worker nodes (though not always). Typically, each worker node executes one executor. Thus, the CPUs and memory settings of the driver and executors set an upper bound on the number of CPUs and amount of memory to use for the run for each executor and the driver.

In practice, this upper bound may not be reached if, for example, you set the memory or CPUs for an executor higher than what is available in the worker node.