Let's say I have the following dataframe:
age_group occupation sex country debt
31_40 attorney M USA 10000
41_50 doctor F Mexico 2000
21_30 dentist F Canada 7000
51_60 engineer M Hungary 9000
61_70 driver M Egypt 23000
Considering it contains millions of rows, how can I get the debt mean for each specific group in only one dataframe, so it would return something like that:
group value debt_mean
country egypt 12500
country usa 25000
age_group 21_30 5000
sex f 15000
sex m 15000
ocuppation driver 5200
Use list comprehension to create of list of means for each of the columns. Then use reduce to union the df in the list of means.
Long hand;
from functools import reduce
meanDF_list= [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]
reduce(lambda a, b: b.unionByName(a),meanDF_list).show()
Chained
reduce(lambda a, b: b.unionByName(a), [df.groupby(x).agg(mean('debt').alias('mean')).toDF('Group','mean') for x in df.drop('debt')]).show()
+--------+-------+
| Group| mean|
+--------+-------+
| USA|10000.0|
| Mexico| 2000.0|
| Canada| 7000.0|
| Hungary| 9000.0|
| Egypt|23000.0|
| M|14000.0|
| F| 4500.0|
|attorney|10000.0|
| doctor| 2000.0|
| dentist| 7000.0|
|engineer| 9000.0|
| driver|23000.0|
| 31_40|10000.0|
| 41_50| 2000.0|
| 21_30| 7000.0|
| 51_60| 9000.0|
| 61_70|23000.0|
+--------+-------+