I have read that reduce function must be commutative and associative. How should I write a function to find the average so it conforms with this requirement? If I apply the following function to count an average for an RDD it will not count the average correctly. Could anyone explain what is wrong with my function?
I guess that it takes two elements say 1, 2 and applies the function to them like (1+2)/2
. Then sums up the result with the next element, 3 and divides it by 2 etc.
val rdd = sc.parallelize(1 to 100)
rdd.reduce((_ + _) / 2)
you can also use PairRDD
to keep track of sum of all elements together with counts of elements.
val pair = sc.parallelize(1 to 100)
.map(x => (x, 1))
.reduce((x, y) => (x._1 + y._1, x._2 + y._2))
val mean = pair._1 / pair._2