Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-dataset

Spark: How can DataFrame be Dataset[Row] if DataFrame's have a schema


This article claims that a DataFrame in Spark is equivalent to a Dataset[Row], but this blog post shows that a DataFrame has a schema.

Take the example in the blog post of converting an RDD to a DataFrame: if DataFrame were the same thing as Dataset[Row], then converting an RDD to a DataFrameshould be as simple

val rddToDF = rdd.map(value => Row(value))

But instead it shows that it's this

val rddStringToRowRDD = rdd.map(value => Row(value))
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = sparkSession.createDataFrame(rddStringToRowRDD,dfschema)
val rDDToDataSet = rddToDF.as[String]

Clearly a dataframe is actually a dataset of rows and a schema.


Solution

  • In Spark 2.0, in code there is: type DataFrame = Dataset[Row]

    It is Dataset[Row], just because of definition.

    Dataset has also schema, you can print it using printSchema() function. Normally Spark infers schema, so you don't have to write it by yourself - however it's still there ;)

    You can also do createTempView(name) and use it in SQL queries, just like DataFrames.

    In other words, Dataset = DataFrame from Spark 1.5 + encoder, that converts rows to your classes. After merging types in Spark 2.0, DataFrame becomes just an alias for Dataset[Row], so without specified encoder.

    About conversions: rdd.map() also returns RDD, it never returns DataFrame. You can do:

    // Dataset[Row]=DataFrame, without encoder
    val rddToDF = sparkSession.createDataFrame(rdd)
    // And now it has information, that encoder for String should be used - so it becomes Dataset[String]
    val rDDToDataSet = rddToDF.as[String]
    
    // however, it can be shortened to:
    val dataset = sparkSession.createDataset(rdd)