Search code examples
scalaapache-sparkcase-class

Benefit of using case class in spark dataframe


What is the advantage of using case class in spark dataframe? I can define the schema using "inferschema" option or define Structtype fields. I referred "https://docs.scala-lang.org/tour/case-classes.html" but could not understand what are the advantages of using case class apart from generating schema using reflection.


Solution

  • inferschema can be an expensive operation and will defer error behavior unnecessarily. consider the following pseudocode

    val df = loadDFWithSchemaInference
    //doing things that takes time
    df.map(row => row.getAs[String]("fieldName")).//more stuff
    

    now in your this code you already have the assumption baked in that fieldName is of type String but it's only expressed and ensured late in your processing leading to unfortunate errors in case it wasn't actually a String

    now if you'd do this instead

    val df = load.as[CaseClass]
    

    or

    val df = load.option("schema", predefinedSchema)
    

    the fact that fieldName is a String will be a precondition and thus your code will be more robust and less error prone.

    schema inference is very handy to have if you do explorative things in the REPL or e.g. Zeppelin but should not be used in operational code.

    Edit Addendum: I personally prefer to use case classes over schemas because I prefer the Dataset API to the Dataframe API (which is Dataset[Row]) for similar robustness reasons.