Say I have an object and I need to make some operations towards the member of this object: arr
.
object A {
val arr = (0 to 1000000).toList
def main(args: Array[String]): Unit = {
//...init spark context
val rdd: RDD[Int] = ...
rdd.map(arr.contains(_)).saveAsTextFile...
}
}
What is the difference between broadcasted arr
and not broadcasted?
i.e.
val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))
and
rdd.map(arr.contains(_))
In my opinion, the object A
is a singleton object, so it will be transferred through the nodes in Spark.
Is it necessary to use broadcast in this scenario?
In the case
rdd.map(arr.contains(_))
arr
is serialized shipped for each task
while in
val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))
this is only done once per executor.
Therefore you should use broadcast when dealing with large datastructures.