Search code examples
pythonpandasweighted-average

Calculate weighted average for multiple columns with NaN values grouped by index in Python


I'm trying to calculate a simple weighted average over multiple columns at once grouped by an index column with some of the column values being NaN.

Below is a sample dataset:

df = pd.DataFrame(np.random.choice([0,1,np.nan], size=(5,5)), columns=list('ABCDE'))
df['F'] = 'XYZ'
df['weight'] = 5 + np.random.sample(5) * 5
A   B   C   D   E   F   weight
0.0 0.0 0.0 NaN NaN XYZ 7.754209
0.0 1.0 1.0 0.0 0.0 XYZ 5.811653
0.0 NaN 1.0 0.0 1.0 XYZ 7.858809
1.0 0.0 1.0 0.0 1.0 XYZ 7.690689
NaN 1.0 0.0 0.0 0.0 XYZ 5.092012

And below is my attempt to calculate the weighted average by excluding the NaN values from both the columns as well as weights.

def weighted_avg(df, index_col, weight_col):
    cols = [c for c in df.columns if c not in ['weight_col', 'index_col']]
    df.loc[:, cols] = df.loc[:, cols].mul(df[weight_col], axis=0)
    agg = df.groupby(df[index_col]).sum()
    agg.loc[:, cols] = agg.loc[:, cols].div(agg[weight_col], axis=0)
    return agg

weighted_avg(df, 'F', 'weight')

However, I'm getting this error:

TypeError: can't multiply sequence by non-int of type 'float'

I checked the data types and all columns except F are float.

This is the expected output:

F   A           B           C           D   E
XYZ 0.26414542  0.413823896 0.624460453 0   0.587812429

I'm relatively new to Python and did find similar problems while extensively searching SO but still couldn't make this simple solution work.

Any help would be much appreciated. TIA.


Solution

  • Simplified solution

    x = df.drop(columns=['F', 'weight']) # x values
    w = x.notna().mul(df['weight'], axis=0) # weights excluding nulls
    
    wx = w * x # weights * x values
    avg = wx.groupby(df['F']).sum() / w.groupby(df['F']).sum() # sum(w * x) / sum(w)
    

    Explained

    Drop the index and weight columns to get x values

    # x
         A    B    C    D    E
    0  0.0  0.0  0.0  NaN  NaN
    1  0.0  1.0  1.0  0.0  0.0
    2  0.0  NaN  1.0  0.0  1.0
    3  1.0  0.0  1.0  0.0  1.0
    4  NaN  1.0  0.0  0.0  0.0
    

    Create a boolean mask using notna then multiply by weights along axis 0 to project the weights values to each column

    # w
              A         B         C         D         E
    0  7.754209  7.754209  7.754209  0.000000  0.000000
    1  5.811653  5.811653  5.811653  5.811653  5.811653
    2  7.858809  0.000000  7.858809  7.858809  7.858809
    3  7.690689  7.690689  7.690689  7.690689  7.690689
    4  0.000000  5.092012  5.092012  5.092012  5.092012
    

    Multiple the x values by weights w

    # wx
              A         B         C    D         E
    0  0.000000  0.000000  0.000000  NaN       NaN
    1  0.000000  5.811653  5.811653  0.0  0.000000
    2  0.000000       NaN  7.858809  0.0  7.858809
    3  7.690689  0.000000  7.690689  0.0  7.690689
    4       NaN  5.092012  0.000000  0.0  0.000000
    

    Group the wx and w dataframe by index column F and aggregate with sum

    # wx.groupby(df['F']).sum()
                A          B          C    D          E
    F                                                  
    XYZ  7.690689  10.903665  21.361151  0.0  15.549498
    
    
    # w.groupby(df['F']).sum()
                A          B          C          D          E
    F                                                        
    XYZ  29.11536  26.348563  34.207372  26.453163  26.453163
    

    Divide the aggregated sums to calculate weighted average

    # avg
                A         B        C    D         E
    F                                              
    XYZ  0.264145  0.413824  0.62446  0.0  0.587812