Search code examples
apache-sparkapache-spark-sqldatasetrdd

spark table manipulation - Column values to rows and row values transposed


I have the following dataset

enter image description here

And i want to convert this to the following using spark. Any pointers would be helpful.

enter image description here


Solution

  • spark 2.4.3 you can you map_from_array and it is pretty straight forward and inbuilt function.

    scala> val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")
    
    scala> df.show
    +-------+----+----+----+
    |cust_id|100x|200x|300x|
    +-------+----+----+----+
    |      1|  40|  60|  10|
    |      2|  34|  10|  20|
    |      3|  87|  29|  62|
    +-------+----+----+----+
    

    Apply map_from_array and explode it will give your desired result

    df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select('cust_id, explode('map)).show(false)
    
    +-------+-------+-----+
    |cust_id|key    |value|
    +-------+-------+-----+
    |1      |cust_id|1    |
    |1      |100x   |40   |
    |1      |200x   |60   |
    |1      |300x   |10   |
    |2      |cust_id|2    |
    |2      |100x   |34   |
    |2      |200x   |10   |
    |2      |300x   |20   |
    |3      |cust_id|3    |
    |3      |100x   |87   |
    |3      |200x   |29   |
    |3      |300x   |62   |
    +-------+-------+-----+
    

    I think built-in function will give more performance as compared to udf.