Search code examples
dataframeapache-sparkpysparkrdd

is there a PySpark function that will merge data from a column for rows with same id?


I have the following dataframe:

+---+---+
| A | B |
+---+---+
| 1 | a |
| 1 | b |
| 1 | c |
| 2 | f |
| 2 | g |
| 3 | j |
+---+---+

I need it to be in a df/rdd format

(1, [a, b, c])
(2, [f, g])
(3, [j])

I'm new to spark and was wondering if this operation can be performed by a single function

I tried using flatmap but I don't think I'm using it correctly


Solution

  • You can group by "A" and then use aggregate function for example collect_set or collect_array

    import pyspark.sql.functions as F
    
    df = [
        {"A": 1, "B": "a"},
        {"A": 1, "B": "b"},
        {"A": 1, "B": "c"},
        {"A": 2, "B": "f"},
        {"A": 2, "B": "g"},
        {"A": 3, "B": "j"}
    ]
    
    df = spark.createDataFrame(df)
    df.groupBy("A").agg(F.collect_set(F.col("B"))).show()
    

    Output

    +---+--------------+
    |  A|collect_set(B)|
    +---+--------------+
    |  1|     [c, b, a]|
    |  2|        [g, f]|
    |  3|           [j]|
    +---+--------------+