Search code examples
scalaapache-sparkrdd

Is it bad to put an RDD inside a Serializable Class?


According to this article, when you use an object inside an RDD.map for example, Spark would serialize the whole ojbect first. Now, let us say, I have an RDD defined as member of that serializable class. What would Spark do for that RDD, would it try to serialize it as well. If so, how?

Following is an example code.

class SomeClass extends Serializable {
 var a: String
 var b: Int
 var rdd: RDD[...]

 ....
}

objectOfSomeClass = new SomeClass(...)
...
someRDD.map(x => someFunc(objectOfSomeClass))

Solution

  • Re:

    I am just worried if serialization of the whole class, also involves serialization of the RDD inside it.

    The code that you have shown does not need whole object to be serialized. Hence you are not facing any serialization issues until now. Instead of passing a and bseparately, if you pass objectOfSomeClass, then I believe you would face serialization issue.

    In one of your comment you have also mentioned

    I am just worried if it affects the performance.

    This too does not come in picture unless you do any action on that RDD. RDDs are lazily evaluated only when any action is called on that RDD. That is the time when it will read and run the transformations. In your example I do not see any action in there hence it should not affect performance of your application.

    Hope this clarifies couple of your doubts.

    -Amit