Search code examples
scalaapache-sparkmapreducerdd

How to find an average for a Spark RDD?


I have read that reduce function must be commutative and associative. How should I write a function to find the average so it conforms with this requirement? If I apply the following function to count an average for an RDD it will not count the average correctly. Could anyone explain what is wrong with my function?

I guess that it takes two elements say 1, 2 and applies the function to them like (1+2)/2. Then sums up the result with the next element, 3 and divides it by 2 etc.

val rdd = sc.parallelize(1 to 100)

rdd.reduce((_ + _) / 2)

Solution

  • you can also use PairRDD to keep track of sum of all elements together with counts of elements.

    val pair = sc.parallelize(1 to 100)
    .map(x => (x, 1))
    .reduce((x, y) => (x._1 + y._1, x._2 + y._2))
    
    val mean = pair._1 / pair._2