Search code examples
dataframescalaapache-spark-sqlpivotunpivot

Scala Unpivot Table


SCALA

I have a table with this struct:

FName SName Email Jan 2021 Feb 2021 Mar 2021 Total 2021
Micheal Scott [email protected] 4000 5000 3400 50660
Dwight Schrute [email protected] 1200 6900 1000 35000
Kevin Malone [email protected] 9000 6000 18000 32000

And i want to transform it to: enter image description here

I tried with 'stack' method but i couldn't get it to work. Thanks


Solution

  • You can flatten the monthly/total columns via explode as shown below:

    val df = Seq(
      ("Micheal", "Scott", "[email protected]", 4000, 5000, 3400, 50660),
      ("Dwight", "Schrute", "[email protected]", 1200, 6900, 1000, 35000),
      ("Kevin", "Malone", "[email protected]", 9000, 6000, 18000, 32000)
    ).toDF("FName","SName", "Email", "Jan 2021", "Feb 2021", "Mar 2021", "Total 2021")
    
    val moYrCols = Array("Jan 2021", "Feb 2021", "Mar 2021", "Total 2021")  // (**)
    val otherCols = df.columns diff moYrCols
    val structCols = moYrCols.map{ c =>
        val moYr = split(lit(c), "\\s+")
        struct(moYr(1).as("Year"), moYr(0).as("Month"), col(c).as("Value"))
      }
    
    df.
      withColumn("flattened", explode(array(structCols: _*))).
      select(otherCols.map(col) :+ $"flattened.*": _*).
      show
    /*
    +-------+-------+------------------+----+-----+-----+
    |  FName|  SName|             Email|Year|Month|Value|
    +-------+-------+------------------+----+-----+-----+
    |Micheal|  Scott| [email protected]|2021|  Jan| 4000|
    |Micheal|  Scott| [email protected]|2021|  Feb| 5000|
    |Micheal|  Scott| [email protected]|2021|  Mar| 3400|
    |Micheal|  Scott| [email protected]|2021|Total|50660|
    | Dwight|Schrute|[email protected]|2021|  Jan| 1200|
    | Dwight|Schrute|[email protected]|2021|  Feb| 6900|
    | Dwight|Schrute|[email protected]|2021|  Mar| 1000|
    | Dwight|Schrute|[email protected]|2021|Total|35000|
    |  Kevin| Malone| [email protected]|2021|  Jan| 9000|
    |  Kevin| Malone| [email protected]|2021|  Feb| 6000|
    |  Kevin| Malone| [email protected]|2021|  Mar|18000|
    |  Kevin| Malone| [email protected]|2021|Total|32000|
    +-------+-------+------------------+----+-----+-----+
    */
    

    (**) Use pattern matching in case there are many columns; for example:

    val moYrCols = df.columns.filter(_.matches("[A-Za-z]+\\s+\\d{4}"))