I have the following dataset
And i want to convert this to the following using spark. Any pointers would be helpful.
spark 2.4.3 you can you map_from_array and it is pretty straight forward and inbuilt function.
scala> val df = Seq((1,40,60,10), (2,34,10,20), (3,87,29,62) ).toDF("cust_id","100x","200x","300x")
scala> df.show
+-------+----+----+----+
|cust_id|100x|200x|300x|
+-------+----+----+----+
| 1| 40| 60| 10|
| 2| 34| 10| 20|
| 3| 87| 29| 62|
+-------+----+----+----+
Apply map_from_array and explode it will give your desired result
df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select('cust_id, explode('map)).show(false)
+-------+-------+-----+
|cust_id|key |value|
+-------+-------+-----+
|1 |cust_id|1 |
|1 |100x |40 |
|1 |200x |60 |
|1 |300x |10 |
|2 |cust_id|2 |
|2 |100x |34 |
|2 |200x |10 |
|2 |300x |20 |
|3 |cust_id|3 |
|3 |100x |87 |
|3 |200x |29 |
|3 |300x |62 |
+-------+-------+-----+
I think built-in function will give more performance as compared to udf.