apache-spark apache-spark-sql dataset rdd

spark table manipulation - Column values to rows and row values transposed

I have the following dataset

And i want to convert this to the following using spark. Any pointers would be helpful.

Solution

spark 2.4.3 you can you map_from_array and it is pretty straight forward and inbuilt function.

scala> val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")

scala> df.show
+-------+----+----+----+
|cust_id|100x|200x|300x|
+-------+----+----+----+
|      1|  40|  60|  10|
|      2|  34|  10|  20|
|      3|  87|  29|  62|
+-------+----+----+----+

Apply map_from_array and explode it will give your desired result

df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select('cust_id, explode('map)).show(false)

+-------+-------+-----+
|cust_id|key    |value|
+-------+-------+-----+
|1      |cust_id|1    |
|1      |100x   |40   |
|1      |200x   |60   |
|1      |300x   |10   |
|2      |cust_id|2    |
|2      |100x   |34   |
|2      |200x   |10   |
|2      |300x   |20   |
|3      |cust_id|3    |
|3      |100x   |87   |
|3      |200x   |29   |
|3      |300x   |62   |
+-------+-------+-----+

I think built-in function will give more performance as compared to udf.