Search code examples
scalaapache-sparkrdd

Spark/Scala update the value of a variable in another map?


In Spark, I have a

closest: org.apache.spark.rdd.RDD[(Int, (breeze.linalg.Vector[Double], Int))] = MapPartitionsRDD[476] at map at command-1043253026161724:1

I want to calculate some total distance:

var tempDist=0.0
closest.foreach(x=> tempDist=tempDist+squaredDistance(x._2._1, kPoints(x._1)))

But this doesn't change tempDist's value at all. I suspect Spark doesn't do anything. So how can I calculate the distance?


Solution

  • Don't do mutable vars. It's a bad idea in general, and doesn't work at all with spark, at least, not the way you are doing it, because it's a distributed system. Different partitions of the sequence are located on different computers, and are being processed independently in parallel and in different JVMs, each of which has its own copy of the var.

      val tempDist = closest
        .map { x => squaredDistance(x._2._1, kPoints(x._1) }
        .fold(0) { _ + _ }