When we broadcast a dataframe (let's say broadcast joins), it broadcasts the same copy of dataframe to all the executors, i.e. each executor (not node, but executor) will hold one copy of the data. So if a node has 3 executors, it unnecessarily holds 3 copies of the same data.
My question is, why can't it use off-heap space? Let's say I give sufficient off-heap memory so why can't it store this data there, just one copy per node and all executors in that node can read from this off heap space.
That's a good question, let me first clarify something:
- Executors are independent JVM processes that execute tasks in isolation to interfere in each other tasks, when you share data between them, the executors would require inter-process communication (IPC), causing bottlenecks due complexity.
- When Spark broadcast data, it guarantees more consistency if one executor broke the data by itself. If it was shared data, this data would be inconsistent for all executors.
- Spark is based on fault-tolerance, so if one task fails, Spark will
re-execute again this task on a new executor with available resources,
keeping a broadcast data copy on each executor simplifies this
process of execution again.
Why can't it use off-heap space?
- Performance: Accessing off-heap memory could be slower than keeping in a local copy, due IPC or other problems.
- Lack of Built-In Mechanisms: Spark's core design revolves around executors as the primary unit of memory management, sharing data directly between executors on the same node would require significant architectural changes to Spark's memory management system.
- Combability issues with different Cluster Managers: Introducing node-level shared memory would require modifications and support from each of these environments.
Alternatives:
- Distributed systems, like HDFS.
- Distributed Caching, like Redis/Memcached.
- Custom Shared Memory, like some files to read.
Resources used:
1 - https://blog.devgenius.io/spark-on-heap-and-off-heap-memory-27b625af778b
2 - https://spark.apache.org/docs/latest/tuning.html