What is the advantage of using case class in spark dataframe? I can define the schema using "inferschema" option or define Structtype fields. I referred "https://docs.scala-lang.org/tour/case-classes.html" but could not understand what are the advantages of using case class apart from generating schema using reflection.
inferschema can be an expensive operation and will defer error behavior unnecessarily. consider the following pseudocode
val df = loadDFWithSchemaInference
//doing things that takes time
df.map(row => row.getAs[String]("fieldName")).//more stuff
now in your this code you already have the assumption baked in that fieldName
is of type String
but it's only expressed and ensured late in your processing leading to unfortunate errors in case it wasn't actually a String
now if you'd do this instead
val df = load.as[CaseClass]
or
val df = load.option("schema", predefinedSchema)
the fact that fieldName
is a String
will be a precondition and thus your code will be more robust and less error prone.
schema inference is very handy to have if you do explorative things in the REPL or e.g. Zeppelin but should not be used in operational code.
Edit Addendum:
I personally prefer to use case classes over schemas because I prefer the Dataset
API to the Dataframe
API (which is Dataset[Row]
) for similar robustness reasons.