scala apache-spark apache-spark-sql unpivot

How to unpivot Spark DataFrame without hardcoding column names in Scala?

Assume you have

val df = Seq(("Jack", 91, 86), ("Mike", 79, 85), ("Julia", 93, 70)).toDF("Name", "Maths", "Art")

which gives:

+-----+-----+---+
| Name|Maths|Art|
+-----+-----+---+
| Jack|   91| 86|
| Mike|   79| 85|
|Julia|   93| 70|
+-----+-----+---+

Now you want to unpivot it by:

df.select($"Name", expr("stack(2, 'Maths', Maths, 'Art', Art) as (Subject, Score)"))

which gives:

+-----+-------+-----+
| Name|Subject|Score|
+-----+-------+-----+
| Jack|  Maths|   91|
| Jack|    Art|   86|
| Mike|  Maths|   79|
| Mike|    Art|   85|
|Julia|  Maths|   93|
|Julia|    Art|   70|
+-----+-------+-----+

So far so godd! Now, what if you don't know the list of column names? What if the list of column names is long or it can change? How can we avoid hardcoding the column names stupidly like that?

Or even something like this is also good:

// fake code
df.select($"Name", unpivot(df.columns.diff("Name")) as ("Subject", "Score"))

Why don't we have api like this?

Solution

This works quite well indeed:

def melt(preserves: Seq[String], toMelt: Seq[String], column: String = "variable", row: String = "value", df: DataFrame) : DataFrame = {
    val _vars_and_vals = array((for (c <- toMelt) yield { struct(lit(c).alias(column), col(c).alias(row)) }): _*)
    val _tmp = df.withColumn("_vars_and_vals", explode(_vars_and_vals))
    val cols = preserves.map(col _) ++ { for (x <- List(column, row)) yield { col("_vars_and_vals")(x).alias(x) }}
    _tmp.select(cols: _*)
}

Source: How to melt Spark DataFrame? Thanks to @user10938362