Search code examples
pandasnumpypandas-groupbyweighted-average

nanmean with weights to calculate weighted average in pandas .agg


I'm using a lambda function in a pandas aggregation to calculate the weighted average. My issue is that if one of the values is nan, the whole result is nan of that group. How can I avoid this?

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],columns = ['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df.loc['b','four'] ='foo'
df.loc['c','four'] ='foo'


        one       two     three four   five found
a  1.046540 -0.304646 -0.982008  bar   True   NaN
b       NaN       NaN       NaN  foo    NaN   foo
c -1.086525  1.086501  0.403910  foo  False   NaN
d       NaN       NaN       NaN  NaN    NaN   NaN
e  0.569420  0.105422  0.192559  bar   True   NaN
f  0.384400 -0.558321  0.324624  bar   True   NaN
g       NaN       NaN       NaN  NaN    NaN   NaN
h  0.656231 -2.185062  0.180535  bar   True   NaN


df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', lambda x: np.average(x, weights=df.loc[x.index, 'two'])))

           sum  weighted_avg
four                        
bar  -2.942608      0.648173
foo   1.086501           NaN

desired result:

           sum  weighted_avg
four                        
bar  -2.942608      0.648173
foo   1.086501     -1.086525 

Unlike this question, this is not the problem that the actual value of the column does not appear, it's a problem of nanmean not having a weighting option.

Another numerical example:

     x      y
0   NaN   18.0
1   NaN   21.0
2   NaN   38.0
3  56.0  150.0
4  65.0  154.0

Here we would wnat to just return the weighted average of the two last rows and ignore the other rows that contain nan.


Solution

  • This appears to be more robust:

    def f(x):
        indices = (~np.isnan(x)) & (~np.isnan(df[weight_column]))[x.index]
        try:
            return np.average(x[indices], weights=df.loc[x.index[indices], weight_column])
        except ZeroDivisionError:
            return np.nan
    
    df = df.groupby('four').agg(sum=('two','sum'), weighted_avg=('one', f))