How to efficiently explode a pyspark dataframe in this way:
+----+-------+------+------+
| id |sport |travel| work |
+----+-------+------+------+
| 1 | 0.2 | 0.4 | 0.6 |
+----+-------+------+------+
| 2 | 0.7 | 0.9 | 0.5 |
+----+-------+------+------+
and my desired output is this:
+------+--------+
| c_id | score |
+------+--------+
| 1 | 0.2 |
+------+--------+
| 1 | 0.4 |
+------+--------+
| 1 | 0.6 |
+------+--------+
| 2 | 0.7 |
+------+--------+
| 2 | 0.9 |
+------+--------+
| 2 | 0.5 |
+------+--------+
First you could put your 3 columns in an array
, then arrays_zip
them and then explode
them and unpack them with .*
, then select
and rename unzipped column.
df.withColumn("zip", F.explode(F.arrays_zip(F.array("sport","travel","work"))))\
.select("id", F.col("zip.*")).withColumnRenamed("0","score").show()
+---+-----+
| id|score|
+---+-----+
| 1| 0.2|
| 1| 0.4|
| 1| 0.6|
| 2| 0.7|
| 2| 0.9|
| 2| 0.5|
+---+-----+
You can also do this without arrays_zip(as mentioned by cPak). Arrays_zip is used for combining arrays in different dataframe columns to struct form, so that you can explode all of them together, and then select with .* . For this case you could just use:
df.withColumn("score", F.explode((F.array(*(x for x in df.columns if x!="id"))))).select("id","score").show()