Search code examples
pythonpandasaggregateweighted-average

weighted average aggregation on multiple columns of df


I'm trying to calculate a weighted average for multiple columns in a dataframe. This is a sample of my data

Group Year Month Weight(kg) Nitrogen Calcium
A 2020 01 10000 10 70
A 2020 01 15000 4 78
A 2021 05 12000 5 66
A 2021 05 10000 8 54
B 2021 08 14000 10 90
C 2021 08 50000 20 92
C 2021 08 40000 10 95

My desired result would look something like this: enter image description here

What I've tried: I can get the correct weighted average values for a single column using this function: (similar to: link)

def wavg(df, value, weight):
    d = df[value]
    w = df[weight]
    try:
        return (d * w).sum() / w.sum()
    except ZeroDivisionError:
        return d.mean()

I can apply this function to a single column of my df:

df2 = df.groupby(["Group", "year", "month"]).apply(wavg, "Calcium", "Weight(kg").to_frame()

(Don't mind the different values, they are correct for the data in my notebook)

enter image description here

The obvious problem is that this function only works for a single column whilst I have a douzens of columns. I therefore tried a for loop:

column_list=[]
for column in df.columns:
  column_list.append(df.groupby(["Group", "year", "month"]).apply(wavg, column, "Weight(kg").to_frame())

It calculates the values correctly, but the columns are placed on top of each other instead of next to eachother. They also miss a usefull column name:

enter image description here

How could I adapt my code to return the desired df?


Solution

  • Change function for working by multiple columns and for avoid removing column for grouping are converting to MultiIndex:

    def wavg(x, value, weight):
        d = x[value]
        w = x[weight]
        try:
            return (d.mul(w, axis=0)).div(w.sum())
        except ZeroDivisionError:
            return d.mean()
    
    #columns used for groupby
    groups = ["Group", "Year", "Month"]
    #processing all another columns
    cols = df.columns.difference(groups + ["Weight(kg)"], sort=False)
    
    #create index and processing all columns by variable cols
    df1 = (df.set_index(groups)
             .groupby(level=groups)
             .apply(wavg, cols, "Weight(kg)")
             .reset_index())
    print (df2)
      Group  Year  Month    Calcium   Nitrogen
    0     A  2020      1  28.000000   4.000000
    1     A  2020      1  46.800000   2.400000
    2     A  2021      5  36.000000   2.727273
    3     A  2021      5  24.545455   3.636364
    4     B  2021      8  90.000000  10.000000
    5     C  2021      8  51.111111  11.111111
    6     C  2021      8  42.222222   4.444444