Search code examples
apache-sparkpyspark

Explode column values into multiple columns in pyspark


I have the following pyspark dataframe.

+--------------------------------------------------------------------------------+
|substitutes                                                                     |
+--------------------------------------------------------------------------------+
|[[1, 30981733]]                                                                 |
|[[1, 598319049], [2, 38453298], [3, 2007569845]]                                |
|[[1, 10309216]]                                                                 |
|[[1, 730446111], [2, 617024811], [3, 665689309], [4, 883699488], [5, 159896736]]|
|[[1, 10290923], [2, 33282357]]                                                  |
|[[1, 102649381], [2, 10294853], [3, 10294854], [4, 44749181], [5, 35132896]]    |
|[[1, 10307642], [2, 10307636], [3, 15754215], [4, 45612359], [5, 10307635]]     |
|[[1, 43982130], [2, 15556050], [3, 15556051], [4, 11961012], [5, 16777263]]     |
|[[1, 849607426], [2, 185158834], [3, 11028011], [4, 10309801], [5, 11028010]]   |
|[[1, 21905160], [2, 21609422], [3, 21609417], [4, 20554612], [5, 20554601]]     |
+--------------------------------------------------------------------------------+

And I would like to explode the columns into multiple columns

| substitutes_1 | substitutes_2 | substitutes_3 | substitutes_4 | substitutes_5 |
|---------------|---------------|---------------|---------------|---------------|
| 30981733      |               |               |               |               |
| 598319049     | 38453298      | 2007569845    |               |               |
| 10309216      |               |               |               |               |
| 730446111     | 617024811     | 665689309     | 883699488     | 159896736     |
| 10290923      | 33282357      |               |               |               |
| 102649381     | 10294853      | 10294854      | 44749181      | 35132896      |
| 10307642      | 10307636      | 15754215      | 45612359      | 10307635      |
| 43982130      | 15556050      | 15556051      | 11961012      | 16777263      |
| 849607426     | 185158834     | 11028011      | 10309801      | 11028010      |
| 21905160      | 21609422      | 21609417      | 20554612      | 20554601      |

How can i unpivot and explode the array?


Solution

  • Create a temporary id column for uniquely identifying each row. Explode the substitutes then extract the first item as col and second item as val. Then group the dataframe by id, pivot on col and aggregate val with first

    (
        df
        .withColumn('id', F.monotonically_increasing_id())
        .selectExpr('id', "explode(substitutes) AS S")
        .selectExpr('id', "'substitutes_' || S[0] AS col", "S[1] as val")
        .groupby('id').pivot('col').agg(F.first('val'))
        .drop('id')
    )
    

    Result

    +-------------+-------------+-------------+-------------+-------------+
    |substitutes_1|substitutes_2|substitutes_3|substitutes_4|substitutes_5|
    +-------------+-------------+-------------+-------------+-------------+
    |    598319049|     38453298|   2007569845|         null|         null|
    |     30981733|         null|         null|         null|         null|
    |    102649381|     10294853|     10294854|     44749181|     35132896|
    |    849607426|    185158834|     11028011|     10309801|     11028010|
    |     10307642|     10307636|     15754215|     45612359|     10307635|
    |    730446111|    617024811|    665689309|    883699488|    159896736|
    |     10290923|     33282357|         null|         null|         null|
    |     43982130|     15556050|     15556051|     11961012|     16777263|
    |     10309216|         null|         null|         null|         null|
    |     21905160|     21609422|     21609417|     20554612|     20554601|
    +-------------+-------------+-------------+-------------+-------------+