Search code examples
pythonsqlpysparkgroup-bywindow-functions

PySpark: Group by and aggregation of matching values


I have a big table like below-

id status
1 In Progress
1 In Progress
2 In Progress
2 Start
1 In Progress
1 Start
3 In Progress
3 Start
1 In Progress
1 Start
1 End
1 In Progress
1 End
2 In Progress
3 In Progress
3 End

In PySpark using functions, want to derive id-wise counts of each status in separate columns like below:

id start-count in-progress-count end-count
1 2 5 2
2 1 2 0
3 1 2 1

Can I make the PartitionBy work in this use-case? If not, what would be the best approach.


Solution

  • You can use group by and use counts with conditions, sample code below:

    from pyspark.sql.functions import count, when
    
    data = [
        (1, "In Progress"),
        (1, "In Progress"),
        (2, "In Progress"),
        (2, "Start"),
        (1, "In Progress"),
        (1, "Start"),
        (3, "In Progress"),
        (3, "Start"),
        (1, "In Progress"),
        (1, "Start"),
        (1, "End"),
        (1, "In Progress"),
        (1, "End"),
        (2, "In Progress"),
        (3, "In Progress"),
        (3, "End"),
    ]
    
    df = spark.createDataFrame(data, ["id", "status"])
    
    output_df = df.groupBy("id").agg(
        count(when(df.status == "Start", True)).alias("start-count"),
        count(when(df.status == "In Progress", True)).alias("in-progress-count"),
        count(when(df.status == "End", True)).alias("end-count"),
    )
    
    output_df.show()
    

    Output:

    +---+-----------+-----------------+---------+
    | id|start-count|in-progress-count|end-count|
    +---+-----------+-----------------+---------+
    |  1|          2|                5|        2|
    |  2|          1|                2|        0|
    |  3|          1|                2|        1|
    +---+-----------+-----------------+---------+