consider the code below:
val rdd: RDD[String] = domainsRDD()
val backlistDomains: Set[String] = readDomainsBlacklist()
rdd.filter(domain => !backlistDomains.contains(domain)
VS code where blacklisted domains are broacasted:
val rdd: RDD[String] = domainsRDD()
val bBacklistDomains: Set[String] = sc.broadcast(readDomainsBlacklist())
rdd.filter(domain => !bBacklistDomains.value.contains(domain))
Despite the fact that broadcasted variable can be erased from executors (via bBacklistDomains.destroy()
) are there any other reasons to use it (performance?)?
(please note, that in the first code example domains
is a local variable and serialization issue will not appear)
There is none, local variables used in stages are automatically broadcasted.
Spark automatically broadcasts the common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and deserialized before running each task.
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
From the docs: https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables