python sql pyspark group-by window-functions

PySpark: Group by and aggregation of matching values

I have a big table like below-

id	status
1	In Progress
1	In Progress
2	In Progress
2	Start
1	In Progress
1	Start
3	In Progress
3	Start
1	In Progress
1	Start
1	End
1	In Progress
1	End
2	In Progress
3	In Progress
3	End

In PySpark using functions, want to derive id-wise counts of each status in separate columns like below:

id	start-count	in-progress-count	end-count
1	2	5	2
2	1	2	0
3	1	2	1

Can I make the PartitionBy work in this use-case? If not, what would be the best approach.

Solution

You can use group by and use counts with conditions, sample code below:

from pyspark.sql.functions import count, when

data = [
    (1, "In Progress"),
    (1, "In Progress"),
    (2, "In Progress"),
    (2, "Start"),
    (1, "In Progress"),
    (1, "Start"),
    (3, "In Progress"),
    (3, "Start"),
    (1, "In Progress"),
    (1, "Start"),
    (1, "End"),
    (1, "In Progress"),
    (1, "End"),
    (2, "In Progress"),
    (3, "In Progress"),
    (3, "End"),
]

df = spark.createDataFrame(data, ["id", "status"])

output_df = df.groupBy("id").agg(
    count(when(df.status == "Start", True)).alias("start-count"),
    count(when(df.status == "In Progress", True)).alias("in-progress-count"),
    count(when(df.status == "End", True)).alias("end-count"),
)

output_df.show()

Output:

+---+-----------+-----------------+---------+
| id|start-count|in-progress-count|end-count|
+---+-----------+-----------------+---------+
|  1|          2|                5|        2|
|  2|          1|                2|        0|
|  3|          1|                2|        1|
+---+-----------+-----------------+---------+