I have the following pyspark dataframe.
+--------------------------------------------------------------------------------+
|substitutes |
+--------------------------------------------------------------------------------+
|[[1, 30981733]] |
|[[1, 598319049], [2, 38453298], [3, 2007569845]] |
|[[1, 10309216]] |
|[[1, 730446111], [2, 617024811], [3, 665689309], [4, 883699488], [5, 159896736]]|
|[[1, 10290923], [2, 33282357]] |
|[[1, 102649381], [2, 10294853], [3, 10294854], [4, 44749181], [5, 35132896]] |
|[[1, 10307642], [2, 10307636], [3, 15754215], [4, 45612359], [5, 10307635]] |
|[[1, 43982130], [2, 15556050], [3, 15556051], [4, 11961012], [5, 16777263]] |
|[[1, 849607426], [2, 185158834], [3, 11028011], [4, 10309801], [5, 11028010]] |
|[[1, 21905160], [2, 21609422], [3, 21609417], [4, 20554612], [5, 20554601]] |
+--------------------------------------------------------------------------------+
And I would like to explode the columns into multiple columns
| substitutes_1 | substitutes_2 | substitutes_3 | substitutes_4 | substitutes_5 |
|---------------|---------------|---------------|---------------|---------------|
| 30981733 | | | | |
| 598319049 | 38453298 | 2007569845 | | |
| 10309216 | | | | |
| 730446111 | 617024811 | 665689309 | 883699488 | 159896736 |
| 10290923 | 33282357 | | | |
| 102649381 | 10294853 | 10294854 | 44749181 | 35132896 |
| 10307642 | 10307636 | 15754215 | 45612359 | 10307635 |
| 43982130 | 15556050 | 15556051 | 11961012 | 16777263 |
| 849607426 | 185158834 | 11028011 | 10309801 | 11028010 |
| 21905160 | 21609422 | 21609417 | 20554612 | 20554601 |
How can i unpivot and explode the array?
Create a temporary id
column for uniquely identifying each row. Explode the substitutes
then extract the first item as col
and second item as val
. Then group the dataframe by id
, pivot
on col
and aggregate val
with first
(
df
.withColumn('id', F.monotonically_increasing_id())
.selectExpr('id', "explode(substitutes) AS S")
.selectExpr('id', "'substitutes_' || S[0] AS col", "S[1] as val")
.groupby('id').pivot('col').agg(F.first('val'))
.drop('id')
)
Result
+-------------+-------------+-------------+-------------+-------------+
|substitutes_1|substitutes_2|substitutes_3|substitutes_4|substitutes_5|
+-------------+-------------+-------------+-------------+-------------+
| 598319049| 38453298| 2007569845| null| null|
| 30981733| null| null| null| null|
| 102649381| 10294853| 10294854| 44749181| 35132896|
| 849607426| 185158834| 11028011| 10309801| 11028010|
| 10307642| 10307636| 15754215| 45612359| 10307635|
| 730446111| 617024811| 665689309| 883699488| 159896736|
| 10290923| 33282357| null| null| null|
| 43982130| 15556050| 15556051| 11961012| 16777263|
| 10309216| null| null| null| null|
| 21905160| 21609422| 21609417| 20554612| 20554601|
+-------------+-------------+-------------+-------------+-------------+