When we submit a job, driver runs main method, converts the application to job/stage/task , communicates with cluster manager for resources and schedules tasks on worker nodes.
So in Spark does driver ever process the data? if no, can we always keep smallest cluster for driver and provide required (sufficient) compute cluster only for worker based on workload?
There are multiple reasons why you might want to have a driver with more resources than the bare minimum. Some examples:
df.collect
on a large dataframe (typically not a good idea), this will collect a very large object inside of your driver. This might cause OOM errors on the driver, for which you might want to increase the driver memory.This is just to give some context as to why you might want to have a bigger driver. But in general, it's true that having a very large driver can be unnecessary in many cases. It's not a bad approach to use small drivers if you don't get driver OOMs or if the driver is not a bottleneck in terms of calculation time.