Search code examples
apache-sparkpyspark

Cluster Sizing - for Driver Node


When we submit a job, driver runs main method, converts the application to job/stage/task , communicates with cluster manager for resources and schedules tasks on worker nodes.

So in Spark does driver ever process the data? if no, can we always keep smallest cluster for driver and provide required (sufficient) compute cluster only for worker based on workload?


Solution

  • There are multiple reasons why you might want to have a driver with more resources than the bare minimum. Some examples:

    • For each task executed on an executor, some data will be sent to the driver and accumulated. If you have a very high number of tasks, this might become a memory problem for your driver. See this SO answer for more info on this.
    • If you need to do df.collect on a large dataframe (typically not a good idea), this will collect a very large object inside of your driver. This might cause OOM errors on the driver, for which you might want to increase the driver memory.
    • If a part of your application executes some intense non-distributed calculations (that are not on RDDs, DataFrames or Datasets) you might want to have a more performant CPU on your driver.
    • When you do a broadcast join, Spark collects the broadcasted data on the driver. Depending on your needs, you might need a bigger driver to handle this. See this SO question for more info.

    This is just to give some context as to why you might want to have a bigger driver. But in general, it's true that having a very large driver can be unnecessary in many cases. It's not a bad approach to use small drivers if you don't get driver OOMs or if the driver is not a bottleneck in terms of calculation time.