I want to take the weighted mean of a column in a group-by statement, like this
import pandas as pd
import numpy as np
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [2, 2, 4, 3, 1, 2]})
df_grouped = df.groupby('group')[['value', 'weight']].apply(lambda x: sum(x['value']*x['weight'])/sum(x['weight']))
df_grouped
Out[17]:
group
A 0.275000
B 0.316667
dtype: float64
So far all is well. However, in some cases the weights sum to zero, for instance
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B', 'B'],
'value': [0.4, 0.3, 0.2, 0.4, 0.3, 0.2],
'weight': [1, 2, 3, 0, 0, 0]})
In this case I want to take the simple mean. The above expression obviously fail because of a divide by zero.
The method I currently use is to replace the weights with one wherever the weights sum to one
df_temp = df.groupby('group')['weight'].transform('sum').reset_index()
df['new_weight'] = np.where(df_temp['weight']==0, 1, df['weight'])
df_grouped = df.groupby('group')[['value', 'new_weight']].apply(lambda x: sum(x['value']*x['new_weight'])/sum(x['new_weight']))
This is an ok solution. But can this be achieved by a one-liner? Some special function for instance?
If you need it to be done in a one-liner it is possible to check whether the Group By Sum is equivalent to zero using a ternary operator inside the lambda as follows. If the group by sum is zero then use the regular mean.
df.groupby('group')[['value', 'weight']].apply(lambda x:sum(x['value'])/len(x['weight']) if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
group
A 0.266667
B 0.300000
dtype: float64
The above snippet's regular mean calculation can be further minified as follows.
df.groupby('group')[['value', 'weight']].apply(lambda x:x['value'].mean() if (sum(x['weight'])) == 0 else sum(x['value']*x['weight'])/sum(x['weight']))
However, I think this type of one liners reduce the readability of the code.