Search code examples
apache-sparkpyspark

pyspark partial transpose of column to row


I am looking for ways how to transform few columns to rows. For example if my input is like below:

+-------+----+----------+--------+
|Country|Year|Population|     GDP|
+-------+----+----------+--------+
|    USA|2020| 331002651|21433226|
|  China|2020|1439323776|16364160|
|  India|2020|1380004385| 3140892|
+-------+----+----------+--------+

I wanted an output like:

+-------+----+----------+----------+
|Country|Year|Measure   |     Value|
+-------+----+----------+----------+
|    USA|2020|GDP       |  21433226|
|  China|2020|GDP       |  16364160|
|  India|2020|GDP       |   3140892|
|    USA|2020|Population| 331002651|
|  China|2020|Population|1439323776|
|  India|2020|Population|1380004385|
+-------+----+----------+----------+

I am referring quite a few like: PySpark Dataframe melt columns into rows

But I am unable to translate to my requirement.


Solution

  • You can use stack to create multiple rows from a single row:

    from pyspark.sql import functions as F
    
    df.select('Country', 'Year', 
              F.expr('stack(2, "Population", Population, "GDP", GDP) as (Measure, value)')) \
      .show()
    

    Output:

    +-------+----+----------+----------+
    |Country|Year|   Measure|     value|
    +-------+----+----------+----------+
    |    USA|2020|Population| 331002651|
    |    USA|2020|       GDP|  21433226|
    |  China|2020|Population|1439323776|
    |  China|2020|       GDP|  16364160|
    |  India|2020|Population|1380004385|
    |  India|2020|       GDP|   3140892|
    +-------+----+----------+----------+