Search code examples
pythonsqlapache-sparkpysparkapache-spark-sql

How to count a boolean in grouped Spark data frame


I want to count how many of records are true in a column from a grouped Spark dataframe but I don't know how to do that in python. For example, I have a data with a region, salary and IsUnemployed column with IsUnemployed as a Boolean. I want to see how many unemployed people in each region. I know we can do a filter and then groupby but I want to generate two aggregation at the same time as below

from pyspark.sql import functions as F  
data.groupby("Region").agg(F.avg("Salary"), F.count("IsUnemployed")) 

Solution

  • Probably the simplest solution is a plain CAST (C style where TRUE -> 1, FALSE -> 0) with SUM:

    (data
        .groupby("Region")
        .agg(F.avg("Salary"), F.sum(F.col("IsUnemployed").cast("long"))))
    

    A little bit more universal and idiomatic solution is CASE WHEN with COUNT:

    (data
        .groupby("Region")
        .agg(
            F.avg("Salary"),
            F.count(F.when(F.col("IsUnemployed"), F.col("IsUnemployed")))))
    

    but here it is clearly an overkill.