Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-dataset

Scala Spark Dataset change class type


I have a dataframe which I created as a schema of MyData1 and then I created a column so that the new dataframe follows the schema of MyData2. And now I want to return the new dataframe as a Dataset but having the following error:

[info]   org.apache.spark.sql.AnalysisException: cannot resolve '`hashed`' given input columns: [id, description];
[info]   at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:110)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$3.applyOrElse(CheckAnalysis.scala:107)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:278)

here is my code:

import org.apache.spark.sql.{DataFrame, Dataset}

case class MyData1(id: String, description: String)


case class MyData2(id: String, description: String, hashed: String) 

object MyObject {

    def read(arg1: String, arg2: String): Dataset[MyData2] {
        var df: DataFrame = null
        val obj1 = new Matcher("cbutrer383", "e8f8chsdfd")
        val obj2 = new Matcher("cbutrer383", "g567g4rwew")
        val obj3 = new Matcher("cbutrer383", "567yr45e45")
        df = Seq(obj1, obj2, obj3).toDF("id", "description")

        df.withColumn("hashed", lit("hash"))

        val ds: Dataset[MyData2] = df.as[MyData2]
        ds
    }
}

I know that there is something probably wrong in the following line but can't figure out

val ds: Dataset[MyData2] = df.as[MyData2]

I am a newbie so probably doing a basic mistake. Anyone can help? TIA


Solution

  • You forgot to assign the newly created Dataframe to df

    df = df.withColumn("hashed", lit("hash"))
    

    withcolumn Spark docs says

    Returns a new Dataset by adding a column or replacing the existing column that has the same name.