Search code examples
group-bynullmeanpython-polarsweighted-average

polars groupby use equal weighted average when sum of weights is zero


I want to calculate the weighted average of a polars dataframe during a groupby operation. If the sum of the weights within the groupby equal zero I want to return the equally weighted mean. Additionally, I want to ignore null values in the columns to average or the weight columns.

Notation:

df (polars.DataFrame): input dataframe
gr (list[str]): columns to groupby
vr (list[str]): columns to take average of
wt (str): column to use as the weights

To calculate the equally weighted average I have

res = df.group_by(gr, maintain_order = True).agg(polars.col(vr).mean())

By default mean function ignores null values in the 'vr' columns, which is desired.

To calculate the weighted average I have

mask_wt = polars.col(wt) * polars.col(vr).is_not_null()
wavg = (polars.col(vr) * polars.col(wt)).sum() / mask_wt.sum()
res = df.group_by(gr, maintain_order = True).agg(wavg)

Masking the weights allows me to ignore null values in 'vr' columns and the 'wt' column.

For example,

import polars

df = polars.DataFrame({'id': ['x', 'x', 'y', 'y'], 
                       'dt': ['j', 'f', 'j', 'f'],
                       'a': [None, 2, 3, 4], 
                       'b': [5, 6, 7, 8], 
                       'wt': [1, 0, 0.25, 0.25]})

gr = ['id']
vr = ['a', 'b']
wt = 'wt'

>>> df
┌─────┬─────┬──────┬─────┬──────┐
│ id  ┆ dt  ┆ a    ┆ b   ┆ wt   │
│ --- ┆ --- ┆ ---  ┆ --- ┆ ---  │
│ str ┆ str ┆ i64  ┆ i64 ┆ f64  │
╞═════╪═════╪══════╪═════╪══════╡
│ x   ┆ j   ┆ null ┆ 5   ┆ 1.0  │
│ x   ┆ f   ┆ 2    ┆ 6   ┆ 0.0  │
│ y   ┆ j   ┆ 3    ┆ 7   ┆ 0.25 │
│ y   ┆ f   ┆ 4    ┆ 8   ┆ 0.25 │
└─────┴─────┴──────┴─────┴──────┘

>>> df.group_by(gr, maintain_order = True).agg(polars.col(vr).mean())
┌─────┬─────┬─────┐
│ id  ┆ a   ┆ b   │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ x   ┆ 2.0 ┆ 5.5 │
│ y   ┆ 3.5 ┆ 7.5 │
└─────┴─────┴─────┘

>>> mask_wt = polars.col(wt) * polars.col(vr).is_not_null()
>>> wavg = (polars.col(vr) * polars.col(wt)).sum() / mask_wt.sum()
>>> res = df.group_by(gr, maintain_order = True).agg(wavg)
┌─────┬─────┬─────┐
│ id  ┆ a   ┆ b   │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ x   ┆ NaN ┆ 5.0 │
│ y   ┆ 3.5 ┆ 7.5 │
└─────┴─────┴─────┘

I would like the NaN in the weighted average to be replaced with the 2 from the equally weighted average because when ignoring the weight for the null value in the column 'a' the sum of the weights is 0, so the equally weighted average should be returned.

Using pandas I can accomplish this by

import pandas

def _wavg_py(gr, vr, wt):
    x = gr[[vr, wt]].dropna()
    den = x[wt].sum()
    if(den == 0):
        return(gr[vr].mean())
    else:
        return((x[vr] * x[wt]).sum() / den)

res = df.groupby(by = gr).apply(_wavg_py, col, wt)

Solution

  • I'm not sure if it is an acceptable approach to .fill_nan() with the .mean()

    df.group_by(gr, maintain_order=True).agg(
        wavg.fill_nan(pl.col(vr).mean())
    )
    
    shape: (2, 3)
    ┌─────┬─────┬─────┐
    │ id  ┆ a   ┆ b   │
    │ --- ┆ --- ┆ --- │
    │ str ┆ f64 ┆ f64 │
    ╞═════╪═════╪═════╡
    │ x   ┆ 2.0 ┆ 5.0 │
    │ y   ┆ 3.5 ┆ 7.5 │
    └─────┴─────┴─────┘
    

    Alternatively, it seems you're asking how to perform conditional logic inside expressions

    i.e. when/then/otherwise

    df.group_by(gr, maintain_order=True).agg(
        pl.when(pl.col('wt').filter(pl.col(vr).is_not_null()).sum() == 0)
          .then(pl.col(vr).mean())
          .otherwise(wavg)
    )
    
    shape: (2, 3)
    ┌─────┬─────┬─────┐
    │ id  ┆ a   ┆ b   │
    │ --- ┆ --- ┆ --- │
    │ str ┆ f64 ┆ f64 │
    ╞═════╪═════╪═════╡
    │ x   ┆ 2.0 ┆ 5.0 │
    │ y   ┆ 3.5 ┆ 7.5 │
    └─────┴─────┴─────┘