Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-1.5

How to transpose dataframe in Spark 1.5 (no pivot operator available)?


I want to transpose following table using spark scala without Pivot function

I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table:

Customer Day   Sales
1        Mon    12
1        Tue    10
1        Thu    15
1        Fri     2
2        Sun    10
2        Wed     5
2        Thu     4
2        Fri     3

Output table :

Customer Sun Mon Tue Wed Thu Fri
   1      0   12  10   0  15  2
   2     10    0   0   5  4   3

Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6:

    var Trans = Cust_Sales.groupBy("Customer").Pivot("Day").sum("Sales")

Solution

  • Not sure how efficient that is, but you can use collect to get all the distinct days, and then add these columns, then use groupBy and sum:

    // get distinct days from data (this assumes there are not too many of them):
    val days: Array[String] = df.select("Day")
        .distinct()
        .collect()
        .map(_.getAs[String]("Day"))
    
    // add column for each day with the Sale value if days match:
    val withDayColumns = days.foldLeft(df) { 
        case (data, day) => data.selectExpr("*", s"IF(Day = '$day', Sales, 0) AS $day")
    }
    
    // wrap it up 
    val result = withDayColumns
       .drop("Day")
       .drop("Sales")
       .groupBy("Customer")
       .sum(days: _*)
    
    result.show()
    

    Which prints (almost) what you wanted:

    +--------+--------+--------+--------+--------+--------+--------+
    |Customer|sum(Tue)|sum(Thu)|sum(Sun)|sum(Fri)|sum(Mon)|sum(Wed)|
    +--------+--------+--------+--------+--------+--------+--------+
    |       1|      10|      15|       0|       2|      12|       0|
    |       2|       0|       4|      10|       3|       0|       5|
    +--------+--------+--------+--------+--------+--------+--------+
    

    I'll leave it to you to rename / reorder the columns if needed.