Search code examples
pythonpandasgroup-byweighted-average

Group by weighted mean, allowing for zero value weights


I want to take the weighted mean of a column in a group-by statement, like this

import pandas as pd
import numpy as np

df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
                   'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
                   'weight': [2, 2, 4, 3, 1, 2]})

df_grouped = df.groupby('group')[['value', 'weight']].apply(lambda x: sum(x['value']*x['weight'])/sum(x['weight']))

df_grouped
Out[17]: 
group
A    0.275000
B    0.316667
dtype: float64

So far all is well. However, in some cases the weights sum to zero, for instance

df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
                   'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
                   'weight': [1, 2, 3, 0, 0, 0]})

In this case I want to take the simple mean. The above expression obviously fail because of a divide by zero.

The method I currently use is to replace the weights with one wherever the weights sum to one

df_temp = df.groupby('group')['weight'].transform('sum').reset_index()
df['new_weight'] = np.where(df_temp['weight']==0, 1, df['weight'])

df_grouped = df.groupby('group')[['value', 'new_weight']].apply(lambda x: sum(x['value']*x['new_weight'])/sum(x['new_weight']))

This is an ok solution. But can this be achieved by a one-liner? Some special function for instance?


Solution

  • If you need it to be done in a one-liner it is possible to check whether the Group By Sum is equivalent to zero using a ternary operator inside the lambda as follows. If the group by sum is zero then use the regular mean.

    df.groupby('group')[['value', 'weight']].apply(lambda x:sum(x['value'])/len(x['weight'])  if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
    
        group
        A    0.266667
        B    0.300000
        dtype: float64
    

    The above snippet's regular mean calculation can be further minified as follows.

    df.groupby('group')[['value', 'weight']].apply(lambda x:x['value'].mean() if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
    

    However, I think this type of one liners reduce the readability of the code.