I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region
, salary
and IsUnemployed
column with IsUnemployed
as a Boolean. I want to see how many unemployed people in each region. I know we can do a filter
and then groupby
but I want to generate two aggregation at the same time as below
from pyspark.sql import functions as F
data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed"))
Probably the simplest solution is a plain CAST
(C style where TRUE
-> 1, FALSE
-> 0) with SUM
:
(data
.groupby("Region")
.agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))
A little bit more universal and idiomatic solution is CASE WHEN
with COUNT
:
(data
.groupby("Region")
.agg(
F.avg("Salary"),
F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))
but here it is clearly an overkill.