Search code examples
scalaapache-spark

Spark Scala RDD[Row] to Dataframe - using toDF not possible


In Spark using Scala - When we have to convert RDD[Row] to DataFrame. Why we have to convert the RDD[Row] to RDD of case classor RDD of tuple in order to use rdd.toDF() Any specific reason it was not provided for the RDD[Row]

object RDDParallelize {
  def main(args: Array[String]): Unit = {

val spark:SparkSession = SparkSession.builder().master("local[1]")
  .appName("learn")
  .getOrCreate()

val abc = Row("val1","val2")
val abc2 = Row("val1","val2")
val rdd1 = spark.sparkContext.parallelize(Seq(abc,abc2))
import spark.implicits._
rdd1.toDF() //doesn't work
  }
}

Solution

  • it is confusing since there are implicit conversion for the toDF methode. Like you may have seen, toDF is not a methode of Rdd class, but it is defined in DatasetHolder, you are using rddToDatasetHolder in SQLImplicits to convert the rdd you created to a DatasetHolder. if you look into the methode rddToDatasetHolder,

    implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
    DatasetHolder(_sqlContext.createDataset(rdd))
    

    }

    you will see that it requires an Encoder of T which is

    Used to convert a JVM object of type T to and from the internal Spark SQL representation.

    if you try to convert a Rdd[Row] to Datasetholder you will need one encoder to tell spark how you convert Row object to internal SQL representation. However

    Primitive types (Int, String, etc) and Product types (case " + "classes) are supported by importing spark.implicits._ Support for serializing other types " + "will be added in future releases

    spark does not have any encoder for Row type so such conversion never finished successfully.