In spark , there can be multiple worker nodes and each worker node can have multiple executors.
When we broadcast
a dataframe in spark, entire dataframe is copied to each executor. However when we 'cache' the data, the partitions of data is persisted across executors within a worker ie. each executor will have few partitions of data, not entire data.
Shuffle
is movement of data across worker nodes
or across executors within same worker node
. So with this, is it fair to conclude that cache
would result in shuffle
?
No. Cache can only result in spill
to local disks of Worker where Executor resides, if not enough memory.