With DataFrames, one can simply rename columns by using df.withColumnRename("oldName", "newName")
. In Datasets, since every field is typed and named, this doesn't seem possible. The only work around I can think of is to use map
on the Dataset:
case class Orig(a: Int, b: Int)
case class OrigRenamed(a: Int, bNewName: Int)
val origDS = Seq(Orig(1,2), Orig(3,4)).toDS
origDS.show
+---+---+
| a| b|
+---+---+
| 1| 2|
| 3| 4|
+---+---+
// To rename with map
val origRenamedDS = origDS.map{ case Orig(x,y) => OrigRenamed(x,y) }
origRenamed.show
+---+--------+
| a|bNewName|
+---+--------+
| 1| 2|
| 3| 4|
+---+--------+
This seems a very round-about and inefficient way just to rename a column. Is there a better way?
A slightly more concise solution would be something like this:
origDS.toDF("a", "bNewName").as[OrigRenamed]
but in practice renaming is simply not meaningful on statically typed Dataset
. While we use the same columnar representation as Dataframe
(Dataset[Row]
) semantics is completely different here.
Name of the column corresponds to a specific field of the stored objects so it is not something that can be dynamically renamed. In other words Datasets
are not statically typed DataFrames
but collections of objects.