Search code examples
scalaapache-sparkapache-spark-sqlrowrdd

toDF() not handling RDD


I have an RDD of Rows called RowRDD. I am simply trying to convert into DataFrame. From the examples I have seen on the internet from various places, I am seeing that I shoudl be trying RowRDD.toDF() I am getting the error :

value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]


Solution

  • It doesn't work because Row is not a Product type and createDataFrame with as single RDD argument is defined only for RDD[A] where A <: Product.

    If you want to use RDD[Row] you have to provide a schema as the second argument. If you think about it is should be obvious. Row is just just a container of Any and as such it doesn't provide enough information for schema inference.

    Assuming this is the same RDD as defined in your previous question then schema is easy to generate:

    import org.apache.spark.sql.types._
    import org.apache.spark.rdd.RD
    
    val rowRdd: RDD[Row] = ???
    val schema = StructType(
      (1 to rowRdd.first.size).map(i => StructField(s"_$i", StringType, false))
    )
    
    val df = sqlContext.createDataFrame(rowRdd, schema)