Search code examples
apache-sparkpysparkapache-spark-sqlgroupingmean

How to get the mean for multiple groups at once in PySpark


Let's say I have the following dataframe:

age_group    occupation   sex   country    debt
31_40          attorney    M    USA        10000
41_50          doctor      F    Mexico     2000
21_30          dentist     F    Canada     7000
51_60          engineer    M    Hungary    9000
61_70          driver      M    Egypt      23000

Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:

group      value     debt_mean
country    egypt     12500
country    usa       25000
age_group  21_30     5000
sex        f         15000
sex        m         15000
ocuppation driver    5200

Solution

  • Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.

    Long hand;

    from functools import reduce
    
    meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]
    
    reduce(lambda a, b: b.unionByName(a),meanDF_list).show()
    

    Chained

    reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()
    
    +--------+-------+
    |   Group|   mean|
    +--------+-------+
    |     USA|10000.0|
    |  Mexico| 2000.0|
    |  Canada| 7000.0|
    | Hungary| 9000.0|
    |   Egypt|23000.0|
    |       M|14000.0|
    |       F| 4500.0|
    |attorney|10000.0|
    |  doctor| 2000.0|
    | dentist| 7000.0|
    |engineer| 9000.0|
    |  driver|23000.0|
    |   31_40|10000.0|
    |   41_50| 2000.0|
    |   21_30| 7000.0|
    |   51_60| 9000.0|
    |   61_70|23000.0|
    +--------+-------+