Search code examples
scalaapache-sparkrddcase-class

Spark Scala convert RDD with Case Class to simple RDD


This is fine:

case class trans(atm : String, num: Int)
    
val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
val rdd = sc.parallelize(array)
val rdd1 = rdd.map(x => (x._1, trans(x._2, x._3)))

How to convert back to a simple RDD like rdd again?

E.g. rdd: org.apache.spark.rdd.RDD[(Int, String, Int)]

I can do this, for sure:

val rdd2 = rdd1.mapValues(v => (v.atm, v.num)).map(x => (x._1, x._2._1, x._2._2))

but what if there is a big record for the class? E.g. dynamically.


Solution

  • Not sure exactly how generic you want to go, but in your example of an RDD[(Int, trans)] you can make use of the unapply method of the trans companion object in order to flatten your case class to a tuple.

    So, if you have your setup:

    case class trans(atm : String, num: Int)
    
    val array = Array((20254552,"ATM",-5100), (20174649,"ATM",5120))
    val rdd = sc.parallelize(array)
    val rdd1 = rdd.map(x => (x._1, trans(x._2, x._3)))
    

    You can do the following:

    import shapeless.syntax.std.tuple._
    
    val output = rdd1.map{
      case (myInt, myTrans) => {
        myInt +: trans.unapply(myTrans).get
      }
    }
    output
    res15: org.apache.spark.rdd.RDD[(Int, String, Int)]
    

    We're importing shapeless.syntax.std.tuple._ in order to be able to make a tuple from our Int + flattened tuple (the myInt +: trans.unapply(myTrans).get operation).