I am using Spark Java API to implement A-Priori algorithm described in MMD, chapter 6, and the algorithm will need to involve a huge int
array like this:
frequent_item[i] = x, // i is a big integer, x is some integer
Now, how to make this array visible to all the worker nodes in the cluster? more specifically,
sc.broadcast(frequent_item)
be used for this purpose?Thanks, as always!
Broadcast is the right approach.
val y = sc.broadcast(frequent_item)
will broadcast frequent_item
and y will become Broadcast[Array[Int]]
and the value can be
accessed by using: y.value
To access (i)th element the code is
val element = y.value(i)
// scala notation
Does this mean this huge array will have a copy in the memory of each worker node? Yes there will be copy of the data in each node.
Best practise a.)estimate the size of the broadcast variable and determine the executor and driver memories with this in consideration. b.) broadcast only when needed c.) unpersist once the broadcast variable is not used.
For more information read Spark Brodcast