I have a big table like below-
id | status |
---|---|
1 | In Progress |
1 | In Progress |
2 | In Progress |
2 | Start |
1 | In Progress |
1 | Start |
3 | In Progress |
3 | Start |
1 | In Progress |
1 | Start |
1 | End |
1 | In Progress |
1 | End |
2 | In Progress |
3 | In Progress |
3 | End |
In PySpark using functions, want to derive id-wise counts of each status in separate columns like below:
id | start-count | in-progress-count | end-count |
---|---|---|---|
1 | 2 | 5 | 2 |
2 | 1 | 2 | 0 |
3 | 1 | 2 | 1 |
Can I make the PartitionBy work in this use-case? If not, what would be the best approach.
You can use group by and use counts with conditions, sample code below:
from pyspark.sql.functions import count, when
data = [
(1, "In Progress"),
(1, "In Progress"),
(2, "In Progress"),
(2, "Start"),
(1, "In Progress"),
(1, "Start"),
(3, "In Progress"),
(3, "Start"),
(1, "In Progress"),
(1, "Start"),
(1, "End"),
(1, "In Progress"),
(1, "End"),
(2, "In Progress"),
(3, "In Progress"),
(3, "End"),
]
df = spark.createDataFrame(data, ["id", "status"])
output_df = df.groupBy("id").agg(
count(when(df.status == "Start", True)).alias("start-count"),
count(when(df.status == "In Progress", True)).alias("in-progress-count"),
count(when(df.status == "End", True)).alias("end-count"),
)
output_df.show()
Output:
+---+-----------+-----------------+---------+
| id|start-count|in-progress-count|end-count|
+---+-----------+-----------------+---------+
| 1| 2| 5| 2|
| 2| 1| 2| 0|
| 3| 1| 2| 1|
+---+-----------+-----------------+---------+