I want to change the dataframe with this structure to the second one.
+---+-----+-----+
| id|order|items|
+---+-----+-----+
| 0| a| 1|
| 1| a| 2|
| 2| a| 5|
| 3| b| 1|
| 4| b| 2|
| 5| b| 3|
| 6| b| 5|
| 7| c| 1|
| 8| c| 2|
+---+-----+-----+
change it to this:
+---+-----+------------+
| id|order| items|
+---+-----+------------+
| 0| a| [1, 2, 5]|
| 1| b|[1, 2, 3, 5]|
| 2| c| [1, 2]|
+---+-----+------------+
How can I do it in PySpark?
Groupby
order with collect_list
function and a unique id with row_number
should work in your case
from pyspark.sql import functions as F
df.groupBy("order").agg(F.collect_list("items"))
.withColumn("id", F.row_number().over(Window.orderBy("order")))
Hope this helps!