Search code examples
python-polars

Polars equivalent of Pandas groupby transform


Is there a way to replicate the groupby -> transform functionality in Polars?

I use this for normalising over groups (make groups sum to one), i.e.

df['normalised'] = df.groupby(groupcols).transform(lambda x: x/x.sum())

In polars I tried a workaround by adding a column for the sum then dividing by that column:

df.join(
    df.group_by(groupcols).agg(pl.col('VOL').sum().alias('VOLSUM'))),
    left_on = [groupcols],
    right_on = [groupcols]
).with_columns(
    (pl.col('VOL') / pl.col('VOLSUM')).alias('VOLNORM')
).drop('VOLSUM')

However, I also need need to compute a cummulative sum within groups:

df['cummulativesum'] = df.groupby(groupcols)[col].transform('cumsum')

Hence, I have two questions:

  1. Is my normalisation over groups the best way?
  2. Is there a way to do a polars equivalent of cumsum within groups?

Thanks in advance!


Solution

  • You can use Window functions

    df = pl.DataFrame(dict(GROUP=[1, 2, 2, 1, 3], VOL=[1, 2, 1, 5, 1]))
    
    shape: (5, 2)
    ┌───────┬─────┐
    │ GROUP ┆ VOL │
    │ ---   ┆ --- │
    │ i64   ┆ i64 │
    ╞═══════╪═════╡
    │ 1     ┆ 1   │
    │ 2     ┆ 2   │
    │ 2     ┆ 1   │
    │ 1     ┆ 5   │
    │ 3     ┆ 1   │
    └───────┴─────┘
    

    .with_columns() and .over():

    df.with_columns(
       VOLNORM = (pl.col.VOL / pl.col.VOL.sum()).over("GROUP"),
       VOLCUMSUM = pl.col.VOL.cum_sum().over("GROUP")
    )
    
    shape: (5, 4)
    ┌───────┬─────┬──────────┬───────────┐
    │ GROUP ┆ VOL ┆ VOLNORM  ┆ VOLCUMSUM │
    │ ---   ┆ --- ┆ ---      ┆ ---       │
    │ i64   ┆ i64 ┆ f64      ┆ i64       │
    ╞═══════╪═════╪══════════╪═══════════╡
    │ 1     ┆ 1   ┆ 0.166667 ┆ 1         │
    │ 2     ┆ 2   ┆ 0.666667 ┆ 2         │
    │ 2     ┆ 1   ┆ 0.333333 ┆ 3         │
    │ 1     ┆ 5   ┆ 0.833333 ┆ 6         │
    │ 3     ┆ 1   ┆ 1.0      ┆ 1         │
    └───────┴─────┴──────────┴───────────┘