python apache-spark pyspark apache-spark-sql rdd

Use groupby or aggregate to merge items in each transaction in RDD or DataFrame to do FP-growth

I want to change the dataframe with this structure to the second one.

+---+-----+-----+
| id|order|items|
+---+-----+-----+
|  0|    a|    1|
|  1|    a|    2|
|  2|    a|    5|
|  3|    b|    1|
|  4|    b|    2|
|  5|    b|    3|
|  6|    b|    5|
|  7|    c|    1|
|  8|    c|    2|
+---+-----+-----+

change it to this:

+---+-----+------------+
| id|order|       items|
+---+-----+------------+
|  0|    a|   [1, 2, 5]|
|  1|    b|[1, 2, 3, 5]|
|  2|    c|      [1, 2]|
+---+-----+------------+

How can I do it in PySpark?

Solution

Groupby order with collect_list function and a unique id with row_number should work in your case

from pyspark.sql import functions as F
df.groupBy("order").agg(F.collect_list("items"))
   .withColumn("id", F.row_number().over(Window.orderBy("order")))

Hope this helps!

Python method chaining in functional programming style
flask-jwt-extended: Fake Authorization Header during testing (pytest)
For loop through the list unless empty?
Polars make all groups the same size
Is there a way to specify a default base-template for all templates in django?
How to tackle time limit exceeded error in leetcode
Is pd.get_dummies() updated in newer versions of Pandas making it default to Booleans (True/False) instead of (0/1)?
What's the function like sum() but for multiplication? product()?
How to type hint a dynamically-created dataclass
Issue with pulling the data with EIA API with Python
403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?
Fullstack web-hosting services
How to handle an AnalysisException on Spark SQL?
Python requests is slow and takes very long to complete HTTP or HTTPS request
Is there a way to modify an element in a Numpy array based on the value of other elements?
Tkinter grid manager height/width nonconsistent
Sql Alchemy Insert Statement failing to insert, but no error
How can I create a Polars struct while eval-ing a list?
Excel using win32com and python
django: on pypy, psyco, unladen swallow or cpython, which one is the fastest?
How convert a list into multiple columns and a dataframe?
Name not defined in type annotation
Static type checkers and Language Servers not recognizing attributes of objects that are subclasses
How do I get multiple OID values in PySNMP?
how to log a file in Django
Is there a simple and efficient way to evaluate Elementary Symmetric Polynomials in Python?
Iterating over two lists one after another
Python -i flag for production
What is PyCompilerFlags in Python C API?
How to make Python check whether a variable is a number or letter