python pandas dataframe group-by aggregate

calculate weighted_average of pandas df in a function

I have a function and in this function I would like to calculate the weighted average of the column other_column (weighted with column amount). If I did not have this in a function then it would work, but like this I am not sure how to pass the dataframe? I'm also getting an error: NameError: name 'df1' is not defined.

def weighted_mean(x):
    try: 
        return np.average(x, weights=df1.loc[x.index, 'amount']) > 0.5
    except ZeroDivisionError:
        return 0

def some_function(df1=None):
    df1 = df1.groupby('id').agg(xx=('amount', lambda x: x.sum() > 100),
                                yy=('other_col', weighted_mean)).reset_index()
    return df1

df2 = pd.DataFrame({'id':[1,1,2,2,3], 'amount':[10, 200, 1, 10, 150], 'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})
df2 = some_function(df1=df2)

so that I get

   id     xx    yy
0   1   True    True
1   2   False   False
2   3   True    False

Solution

Your fundamental issue is that you try to apply a groupby.agg with a function that relies on multiple columns. That's impossible, unless you rely on side effects, which cannot allow a general function (the function must be designed to hardcode the side effect).

# the function is hardcoded to use df2
# this makes it non generic
def weighted_mean(x):
    try: 
        return np.average(x, weights=df2.loc[x.index, 'amount']) > 0.5
    except ZeroDivisionError:
        return 0

Instead, use groupby.apply and rewrite your function to take a DataFrame as input:

def weighted_mean(df):
    try: 
        return np.average(df['other_col'], weights=df['amount']) > 0.5
    except ZeroDivisionError:
        return 0

def some_function(df=None):
    def inner(g):
        return pd.Series({
            'xx': g['amount'].sum()>100,
            'yy': weighted_mean(g),
        })
    
    return (df.groupby('id', as_index=False)
              .apply(inner)
            )

df2 = pd.DataFrame({'id':[1,1,2,2,3],
                    'amount':[10, 200, 1, 10, 150],
                    'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})

out = some_function(df=df2)

Alternatively, define weighted_mean as an inner function of some_function:


def some_function(df=None):
    def weighted_mean(x):
        try: 
            return np.average(x, weights=df.loc[x.index, 'amount']) > 0.5
        except ZeroDivisionError:
            return 0
   
    return (df.groupby('id')
              .agg(xx=('amount', lambda x: x.sum() > 100),
                                yy=('other_col', weighted_mean))
              .reset_index()
            )

df2 = pd.DataFrame({'id':[1,1,2,2,3],
                    'amount':[10, 200, 1, 10, 150],
                    'other_col':[0.1, 0.6, 0.7, 0.2, 0.4]})

out = some_function(df=df2)

Output:

   id     xx     yy
0   1   True   True
1   2  False  False
2   3   True  False