apache-spark pyspark apache-spark-sql grouping mean

How to get the mean for multiple groups at once in PySpark

Let's say I have the following dataframe:

age_group    occupation   sex   country    debt
31_40          attorney    M    USA        10000
41_50          doctor      F    Mexico     2000
21_30          dentist     F    Canada     7000
51_60          engineer    M    Hungary    9000
61_70          driver      M    Egypt      23000

Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:

group      value     debt_mean
country    egypt     12500
country    usa       25000
age_group  21_30     5000
sex        f         15000
sex        m         15000
ocuppation driver    5200

Solution

Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.

Long hand;

from functools import reduce

meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]

reduce(lambda a, b: b.unionByName(a),meanDF_list).show()

Chained

reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()

+--------+-------+
|   Group|   mean|
+--------+-------+
|     USA|10000.0|
|  Mexico| 2000.0|
|  Canada| 7000.0|
| Hungary| 9000.0|
|   Egypt|23000.0|
|       M|14000.0|
|       F| 4500.0|
|attorney|10000.0|
|  doctor| 2000.0|
| dentist| 7000.0|
|engineer| 9000.0|
|  driver|23000.0|
|   31_40|10000.0|
|   41_50| 2000.0|
|   21_30| 7000.0|
|   51_60| 9000.0|
|   61_70|23000.0|
+--------+-------+