Search code examples
apache-sparkbroadcast

Is it necessary to broadcast an object member in Spark?


Say I have an object and I need to make some operations towards the member of this object: arr.

object A {
  val arr = (0 to 1000000).toList
  def main(args: Array[String]): Unit = {
    //...init spark context
    val rdd: RDD[Int] = ...
    rdd.map(arr.contains(_)).saveAsTextFile...
  }
}

What is the difference between broadcasted arr and not broadcasted? i.e.

val arrBr = sc.broadcast(arr)
rdd.map(arrBr.value.contains(_))

and

rdd.map(arr.contains(_))

In my opinion, the object A is a singleton object, so it will be transferred through the nodes in Spark.

Is it necessary to use broadcast in this scenario?


Solution

  • In the case

    rdd.map(arr.contains(_))
    

    arr is serialized shipped for each task

    while in

    val arrBr = sc.broadcast(arr)
    rdd.map(arrBr.value.contains(_))
    

    this is only done once per executor.

    Therefore you should use broadcast when dealing with large datastructures.