Search code examples
pythonpandasdataframegroup-byaggregate

calculate weighted_average of pandas df in a function


I have a function and in this function I would like to calculate the weighted average of the column other_column (weighted with column amount). If I did not have this in a function then it would work, but like this I am not sure how to pass the dataframe? I'm also getting an error: NameError: name 'df1' is not defined.

def weighted_mean(x):
    try: 
        return np.average(x, weights=df1.loc[x.index, 'amount']) > 0.5
    except ZeroDivisionError:
        return 0

def some_function(df1=None):
    df1 = df1.groupby('id').agg(xx=('amount', lambda x: x.sum() > 100),
                                yy=('other_col', weighted_mean)).reset_index()
    return df1

df2 = pd.DataFrame({'id':[1,1,2,2,3], 'amount':[10, 200, 1, 10, 150], 'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
df2 = some_function(df1=df2)

so that I get

   id     xx    yy
0   1   True    True
1   2   False   False
2   3   True    False

Solution

  • Your fundamental issue is that you try to apply a groupby.agg with a function that relies on multiple columns. That's impossible, unless you rely on side effects, which cannot allow a general function (the function must be designed to hardcode the side effect).

    # the function is hardcoded to use df2
    # this makes it non generic
    def weighted_mean(x):
        try: 
            return np.average(x, weights=df2.loc[x.index, 'amount']) > 0.5
        except ZeroDivisionError:
            return 0
    

    Instead, use groupby.apply and rewrite your function to take a DataFrame as input:

    def weighted_mean(df):
        try: 
            return np.average(df['other_col'], weights=df['amount']) > 0.5
        except ZeroDivisionError:
            return 0
    
    def some_function(df=None):
        def inner(g):
            return pd.Series({
                'xx': g['amount'].sum()>100,
                'yy': weighted_mean(g),
            })
        
        return (df.groupby('id', as_index=False)
                  .apply(inner)
                )
    
    df2 = pd.DataFrame({'id':[1,1,2,2,3],
                        'amount':[10, 200, 1, 10, 150],
                        'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
    
    out = some_function(df=df2)
    

    Alternatively, define weighted_mean as an inner function of some_function:

    
    def some_function(df=None):
        def weighted_mean(x):
            try: 
                return np.average(x, weights=df.loc[x.index, 'amount']) > 0.5
            except ZeroDivisionError:
                return 0
       
        return (df.groupby('id')
                  .agg(xx=('amount', lambda x: x.sum() > 100),
                                    yy=('other_col', weighted_mean))
                  .reset_index()
                )
    
    df2 = pd.DataFrame({'id':[1,1,2,2,3],
                        'amount':[10, 200, 1, 10, 150],
                        'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
    
    out = some_function(df=df2)
    

    Output:

       id     xx     yy
    0   1   True   True
    1   2  False  False
    2   3   True  False