I am looking for ways how to transform few columns to rows. For example if my input is like below:
+-------+----+----------+--------+
|Country|Year|Population| GDP|
+-------+----+----------+--------+
| USA|2020| 331002651|21433226|
| China|2020|1439323776|16364160|
| India|2020|1380004385| 3140892|
+-------+----+----------+--------+
I wanted an output like:
+-------+----+----------+----------+
|Country|Year|Measure | Value|
+-------+----+----------+----------+
| USA|2020|GDP | 21433226|
| China|2020|GDP | 16364160|
| India|2020|GDP | 3140892|
| USA|2020|Population| 331002651|
| China|2020|Population|1439323776|
| India|2020|Population|1380004385|
+-------+----+----------+----------+
I am referring quite a few like: PySpark Dataframe melt columns into rows
But I am unable to translate to my requirement.
You can use stack to create multiple rows from a single row:
from pyspark.sql import functions as F
df.select('Country', 'Year',
F.expr('stack(2, "Population", Population, "GDP", GDP) as (Measure, value)')) \
.show()
Output:
+-------+----+----------+----------+
|Country|Year| Measure| value|
+-------+----+----------+----------+
| USA|2020|Population| 331002651|
| USA|2020| GDP| 21433226|
| China|2020|Population|1439323776|
| China|2020| GDP| 16364160|
| India|2020|Population|1380004385|
| India|2020| GDP| 3140892|
+-------+----+----------+----------+