Search code examples
apache-sparkpyspark

cache resulting in shuffle


In spark , there can be multiple worker nodes and each worker node can have multiple executors.

When we broadcast a dataframe in spark, entire dataframe is copied to each executor. However when we 'cache' the data, the partitions of data is persisted across executors within a worker ie. each executor will have few partitions of data, not entire data.

Shuffle is movement of data across worker nodes or across executors within same worker node. So with this, is it fair to conclude that cache would result in shuffle?


Solution

  • No. Cache can only result in spill to local disks of Worker where Executor resides, if not enough memory.