is there a PySpark function that will merge data from a column for rows with same id?

I have the following dataframe:

+---+---+
| A | B |
+---+---+
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | f |
| 2 | g |
| 3 | j |
+---+---+

I need it to be in a df/rdd format

(1, [a, b, c])
(2, [f, g])
(3, [j])

I'm new to spark and was wondering if this operation can be performed by a single function

I tried using flatmap but I don't think I'm using it correctly

Solution

You can group by "A" and then use aggregate function for example collect_set or collect_array

import pyspark.sql.functions as F

df = [
    {"A": 1, "B": "a"},
    {"A": 1, "B": "b"},
    {"A": 1, "B": "c"},
    {"A": 2, "B": "f"},
    {"A": 2, "B": "g"},
    {"A": 3, "B": "j"}
]

df = spark.createDataFrame(df)
df.groupBy("A").agg(F.collect_set(F.col("B"))).show()

Output

+---+--------------+
|  A|collect_set(B)|
+---+--------------+
|  1|     [c, b, a]|
|  2|        [g, f]|
|  3|           [j]|
+---+--------------+

(Polars) How to get element from a column with list by index specified in another column
Cumulative subtraction in Pandas Dataframe?
Create Pivot table and add additional columns from another dataframe
How can I map a field of a polars struct from values of another field `a`, to values of another field `b`?
How to subset R dataframe based on specific values in several columns?
Polars pivot + unpivot operation with multiple values (pandas stack / unstack alternative / UDF over)
How can I subclass a Pandas DataFrame?
Why polars date time subseting is slow?
Iterating and updating a column using a list
Pandas create % and # distribution list in descending order for each group
Store numpy.array in cells of a Pandas.DataFrame
Pandas - Merge DataFrame, keep non-null values on common columns, keep average on another column
Multiply Columns Together Based on Condition
Conversion of Decimal Minutes Seconds into Decimal Degrees when cells within the Columns Contains the Factor 'Missed' in R
Unable to concatenate dataframes in streamlit
How to generate a table with a list of possible entries, rather than the count?
Polars column from conditioned look up of dictionary values
Passing column names as strings do not work with filter() and only with filter()
how to panda concatenate on list with string dataframe names?
How to filter df by value list with Polars?
How can I shift a column in a dataframe x times into an array, apply a function, and create a new column?
Separate a ingredients/feature into separate columns that is marked with "0" or "1"
Find non-overlapping intervals within DNA coordinates
What is the best way to filter groups by conditionally checking the values of the first row of each group only?
Separating a Dataframe Column by Multiple Delimiters Into New Columns Using Pandas
Multiple aggregations on multiple columns in Python polars
How I can round up values in Polars
What's the polars equivalent to the pandas `.iloc` method?
Pandas isin function in polars
Polars equivalent of pandas expression df.groupby['col1','col2']['col3'].sum().unstack()