Search code examples
apache-sparkapache-spark-dataset

How to efficiently rename columns in Datasets (Spark 2.0)


With DataFrames, one can simply rename columns by using df.withColumnRename("oldName", "newName"). In Datasets, since every field is typed and named, this doesn't seem possible. The only work around I can think of is to use map on the Dataset:

case class Orig(a: Int, b: Int)
case class OrigRenamed(a: Int, bNewName: Int)

val origDS = Seq(Orig(1,2), Orig(3,4)).toDS
origDS.show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
|  3|  4|
+---+---+

// To rename with map
val origRenamedDS = origDS.map{ case Orig(x,y) => OrigRenamed(x,y) }
origRenamed.show
+---+--------+
|  a|bNewName|
+---+--------+
|  1|       2|
|  3|       4|
+---+--------+

This seems a very round-about and inefficient way just to rename a column. Is there a better way?


Solution

  • A slightly more concise solution would be something like this:

    origDS.toDF("a", "bNewName").as[OrigRenamed]
    

    but in practice renaming is simply not meaningful on statically typed Dataset. While we use the same columnar representation as Dataframe (Dataset[Row]) semantics is completely different here.

    Name of the column corresponds to a specific field of the stored objects so it is not something that can be dynamically renamed. In other words Datasets are not statically typed DataFrames but collections of objects.