I'm in trouble trying to process a vast amount of data on a cluster.
The code:
val (sumZ, batchSize) = data.rdd.repartition(4)
.treeAggregate(0L, 0L))(
seqOp = (c, v) => {
// c: (z, count), v
val step = this.update(c, v)
(step._1, c._2 + 1)
},
combOp = (c1, c2) => {
// c: (z, count)
(c1._1 + c2._1, c1._2 + c2._2)
})
val finalZ = sumZ / 4
As you can see in the code, my current approach is to process this data partitioned into 4 chunks (x0, x1, x2, x3) making all the process independent. Each process generates an output (z0, z1, z2, z3), and the final value of z is an average of these 4 results.
This approach is working but the precision (and the computing time) is affected by the number of partitions.
My question is if there is a way of generating a "global" z that will be updated from every process (partition).
TL;DR There is not. Spark doesn't have shared memory with synchronized access so no true global access can exist.
The only form of "shared" writable variable in Spark is Accumulator
. It allows write only access with commutative and associative function.
Since its implementation is equivalent to reduce
/ aggregate
:
it won't resolve your problem.