Search code examples
pythonapache-sparkpysparkapache-spark-sqlrdd

Use groupby or aggregate to merge items in each transaction in RDD or DataFrame to do FP-growth


I want to change the dataframe with this structure to the second one.

+---+-----+-----+
| id|order|items|
+---+-----+-----+
|  0|    a|    1|
|  1|    a|    2|
|  2|    a|    5|
|  3|    b|    1|
|  4|    b|    2|
|  5|    b|    3|
|  6|    b|    5|
|  7|    c|    1|
|  8|    c|    2|
+---+-----+-----+

change it to this:

+---+-----+------------+
| id|order|       items|
+---+-----+------------+
|  0|    a|   [1, 2, 5]|
|  1|    b|[1, 2, 3, 5]|
|  2|    c|      [1, 2]|
+---+-----+------------+

How can I do it in PySpark?


Solution

  • Groupby order with collect_list function and a unique id with row_number should work in your case

    from pyspark.sql import functions as F
    df.groupBy("order").agg(F.collect_list("items"))
       .withColumn("id", F.row_number().over(Window.orderBy("order")))
    

    Hope this helps!