Search code examples
scaladataframeapache-sparkcase-class

How to select columns that exist in case classes from DataFrame


Given a spark DataFrame with columns "id", "first", "last", "year"

val df=sc.parallelize(Seq(
  (1, "John", "Doe", 1986),
  (2, "Ive", "Fish", 1990),
  (4, "John", "Wayne", 1995)
)).toDF("id", "first", "last", "year")

and case class

case class IdAndLastName(
id: Int,
last:String )

I would like to only select columns in case class which are id and last. In other words, I would like to have this output df.select("id","last") by using case class. I am avoiding hardcoding the attributes. Could you please help me how can I achieve this in a compact way.


Solution

  • You can create explictly an encoder for the case class (usually this happens implicitly here). Then you can get the field names from the encoder and use them in the select statement:

    val fieldnames = Encoders.product[IdAndLastName].schema.fieldNames
    df.select(fieldnames.head, fieldnames.tail:_*).show()
    

    Output:

    +---+-----+
    | id| last|
    +---+-----+
    |  1|  Doe|
    |  2| Fish|
    |  4|Wayne|
    +---+-----+