Search code examples
scalaapache-sparkrdd

modifying RDD of object in spark (scala)


I have:

val rdd1: RDD[myClass]

it has been initialized, i checked while debugging all the members have got thier default values

If i do

rdd1.foreach(x=>x.modifier())

where modifier is a member function of myClass which modifies some of the member variables

After executing this if i check the values inside the RDD they have not been modified.

Can someone explain what's going on here? And is it possible to make sure the values are modified inside the RDD?

EDIT:

class myClass(var id:String,var sessions: Buffer[Long],var avgsession: Long)  {
    def calcAvg(){
   // calculate avg by summing over sessions and dividing by legnth
   // Store this average in avgsession
    }
}

The avgsession attribute is not updating if i do

myrdd.foreach(x=>x.calcAvg())

Solution

  • RDD are immutable, calling a mutating method on the objects it contains will not have any effect.

    The way to obtain the result you want is to produce new copies of MyClass instead of modifying the instance:

    case class MyClass(id:String, avgsession: Long) {
        def modifier(a: Int):MyClass = 
           this.copy(avgsession = this.avgsession + a) 
    }
    

    Now you still cannot update rdd1, but you can obtain rdd2 that will contain the updated instances:

    rdd2 = rdd1.map (_.modifier(18) )