Search code examples
apache-sparkapriori

how to make a huge array visible to all worker nodes in Spark


I am using Spark Java API to implement A-Priori algorithm described in MMD, chapter 6, and the algorithm will need to involve a huge int array like this:

frequent_item[i] = x, // i is a big integer, x is some integer

Now, how to make this array visible to all the worker nodes in the cluster? more specifically,

  1. can sc.broadcast(frequent_item) be used for this purpose?
  2. does this mean this huge array will have a copy in the memory of each worker node?
  3. what would be the best practice guideline for things like this?

Thanks, as always!


Solution

  • Broadcast is the right approach.

    1. val y = sc.broadcast(frequent_item) will broadcast frequent_item and y will become Broadcast[Array[Int]] and the value can be accessed by using: y.value

      To access (i)th element the code is val element = y.value(i) // scala notation

    2. Does this mean this huge array will have a copy in the memory of each worker node? Yes there will be copy of the data in each node.

    3. Best practise a.)estimate the size of the broadcast variable and determine the executor and driver memories with this in consideration. b.) broadcast only when needed c.) unpersist once the broadcast variable is not used.

    For more information read Spark Brodcast