Search code examples
pythonapache-sparkpysparkexplode

Efficiently transpose/explode spark dataframe columns into rows in a new table/dataframe format [pyspark]


How to efficiently explode a pyspark dataframe in this way:

+----+-------+------+------+
| id |sport  |travel| work |
+----+-------+------+------+
| 1  | 0.2   | 0.4  | 0.6  |
+----+-------+------+------+
| 2  | 0.7   | 0.9  | 0.5  |
+----+-------+------+------+

and my desired output is this:

+------+--------+  
| c_id | score  |  
+------+--------+  
| 1    | 0.2    |  
+------+--------+  
| 1    | 0.4    |  
+------+--------+  
| 1    | 0.6    |  
+------+--------+  
| 2    | 0.7    |  
+------+--------+  
| 2    | 0.9    |  
+------+--------+  
| 2    | 0.5    |  
+------+--------+  

Solution

  • First you could put your 3 columns in an array, then arrays_zip them and then explode them and unpack them with .*, then select and rename unzipped column.

    df.withColumn("zip", F.explode(F.arrays_zip(F.array("sport","travel","work"))))\
      .select("id", F.col("zip.*")).withColumnRenamed("0","score").show()
    
    +---+-----+
    | id|score|
    +---+-----+
    |  1|  0.2|
    |  1|  0.4|
    |  1|  0.6|
    |  2|  0.7|
    |  2|  0.9|
    |  2|  0.5|
    +---+-----+
    

    You can also do this without arrays_zip(as mentioned by cPak). Arrays_zip is used for combining arrays in different dataframe columns to struct form, so that you can explode all of them together, and then select with .* . For this case you could just use:

    df.withColumn("score", F.explode((F.array(*(x for x in df.columns if x!="id"))))).select("id","score").show()