Search code examples
apache-sparkserializationbroadcast

What is the difference between passing local variable VS broadcast variable to spark pipeline?


consider the code below:

val rdd: RDD[String] = domainsRDD()
val backlistDomains: Set[String] = readDomainsBlacklist()
rdd.filter(domain => !backlistDomains.contains(domain)

VS code where blacklisted domains are broacasted:

val rdd: RDD[String] = domainsRDD()
val bBacklistDomains: Set[String] = sc.broadcast(readDomainsBlacklist())
rdd.filter(domain => !bBacklistDomains.value.contains(domain))

Despite the fact that broadcasted variable can be erased from executors (via bBacklistDomains.destroy()) are there any other reasons to use it (performance?)? (please note, that in the first code example domains is a local variable and serialization issue will not appear)


Solution

  • There is none, local variables used in stages are automatically broadcasted.

    Spark automatically broadcasts the common data needed by tasks within each stage.
    The data broadcasted this way is cached in serialized form and deserialized before running each task.
    This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
    

    From the docs: https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables