Search code examples
apache-sparkpyspark

In a Spark cluster, is a copy of a broadcast variable kept on every executor process or only on every machine?


I am reading Spark: The Definitive Guide.

One question that occurs to me while reading is if a copy of the broadcast variable made for every executor process on a machine, or only once per machine?

Since the broadcast variable is supposed to be immutable, it makes sense for there to only be one per machine, but the text I'm reading is not clear about this.


Solution

  • From the docs, emphasis mine:

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.