Search code examples
pythonpandasdataframepivot-tableweighted-average

pandas pivot table: calculate weighted averages through aggfunc


I've got a pandas dataframe on education and income that looks basically like this.

import pandas as pd
import numpy as np

data = {
    'education': ['Low', 'High', 'High', 'Medium', 'Low', 'Low', 'High', 'Low', 'Medium', 'Medium'],
    'income': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'weights': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
}
    
df = pd.DataFrame(data, columns=['education', 'income', 'weights'])

print(df)
  education  income  weights
0       Low       1       11
1      High       2       12
2      High       3       13
3    Medium       4       14
4       Low       5       15
5       Low       6       16
6      High       7       17
7       Low       8       18
8    Medium       9       19
9    Medium      10       20

I've created a pivot table that computes the mean income for each category of education, like so:

pivot_educ_inc = pd.pivot_table(df, 
                                values='income',
                                index='education',
                                aggfunc=np.mean)

print(pivot_educ_inc)
             income
education          
High       4.000000
Low        5.000000
Medium     7.666667

What I'd really like is to use my weights column to get the weighted means of income for each level of education. But I can't find a way to define a weighted means function that I can assign to aggfunc and that can do this.

It isn't really convenient (possible?) for me to simply create a weighted dataset because the weights add up to over 100 million. Also, ideally I'd like to use the aggfunc argument because I've got many more columns like education in my dataset that I'd like to compute weighted averages for, some of which have upwards of 25 categories.

I might be totally overlooking something here, but I'm stumped.


Solution

  • I'm a big fan of pivot_table, so here it goes a solution using it:

    pivot = df.pivot_table(values='income',
                           index='education',
                           aggfunc=lambda rows: np.average(rows, weights=df.loc[rows.index, 'weights']))
    

    The resulting dataframe will be as follows:

                 income
    education          
    High       4.333333
    Low        5.433333
    Medium     8.056604