Search code examples
scalaapache-sparkdistributed-computing

Distributed process updating a global/single variable in Spark


I'm in trouble trying to process a vast amount of data on a cluster.

The code:

val (sumZ, batchSize) = data.rdd.repartition(4)
  .treeAggregate(0L, 0L))(
    seqOp = (c, v) => {
      // c: (z, count), v
      val step = this.update(c, v)
      (step._1, c._2 + 1)
    },
    combOp = (c1, c2) => {
      // c: (z, count)
      (c1._1 + c2._1, c1._2 + c2._2)
    })

val finalZ = sumZ / 4

As you can see in the code, my current approach is to process this data partitioned into 4 chunks (x0, x1, x2, x3) making all the process independent. Each process generates an output (z0, z1, z2, z3), and the final value of z is an average of these 4 results.

This approach is working but the precision (and the computing time) is affected by the number of partitions.

My question is if there is a way of generating a "global" z that will be updated from every process (partition).


Solution

  • TL;DR There is not. Spark doesn't have shared memory with synchronized access so no true global access can exist.

    The only form of "shared" writable variable in Spark is Accumulator. It allows write only access with commutative and associative function.

    Since its implementation is equivalent to reduce / aggregate:

    • Each partition has its own copy which is updated locally.
    • After task is completed partial results are send to the driver and combined with "global" instance.

    it won't resolve your problem.