Search code examples
python-polars

Cum Max Sum in Polars


In the following dataframe, I want to populate the column cum_max_sum using the columns minimum and to_add. How can I achieve this using Polars operations?

┌──────────────────┬─────────┬────────┬─────────────┐
│ prev_cum_max_sum ┆ minimum ┆ to_add ┆ cum_max_sum │
│ ---              ┆ ---     ┆ ---    ┆ ---         │
│ i64              ┆ i64     ┆ i64    ┆ i64         │
╞══════════════════╪═════════╪════════╪═════════════╡
│ null             ┆ 1       ┆ 5      ┆ 6           │
│ 6                ┆ 4       ┆ 6      ┆ 12          │
│ 12               ┆ 13      ┆ 7      ┆ 20          │
└──────────────────┴─────────┴────────┴─────────────┘

Explanation:

max(null, 1) + 5 = 6

max(6, 4) + 6 = 12

max(12, 13) + 7 = 20


Solution

  • polars is vectorized so you can't really condition the next value on the previous value except with clever uses of cum___ functions or maybe dynamic groupbys.

    You can take advantage of numba's ability to make vectorized ufuncs and polars's ability to use ufuncs seamlessly.

    import numba as nb
    
    @nb.guvectorize([(nb.int64[:], nb.int64[:], nb.int64[:])], '(n),(n)->(n)', nopython=True)
    def g(x,y, res):
        res[0]=x[0]+y[0]
        for i in range(1, len(x)):
            if res[i-1]>x[i]:
                res[i]=res[i-1]+y[i]
            else:
                res[i]=x[i]+y[i]
    

    Now you have a ufunc called g which can do what you need.

    Let's make a bigger df and try it out.

    import polars as pl
    import numpy as np
    
    np.random.seed(10)
    df=pl.DataFrame({'minimum':np.random.randint(0,50,100),
                     'to_add':np.random.randint(0,5,100)})
    print(df.with_columns(cum_max_sum=g(pl.col('minimum'), pl.col('to_add')))[0:8])
    shape: (8, 3)
    ┌─────────┬────────┬─────────────┐
    │ minimum ┆ to_add ┆ cum_max_sum │
    │ ---     ┆ ---    ┆ ---         │
    │ i64     ┆ i64    ┆ i64         │
    ╞═════════╪════════╪═════════════╡
    │ 9       ┆ 0      ┆ 9           │
    │ 36      ┆ 1      ┆ 37          │
    │ 15      ┆ 1      ┆ 38          │
    │ 0       ┆ 0      ┆ 38          │
    │ 49      ┆ 2      ┆ 51          │
    │ 28      ┆ 3      ┆ 54          │
    │ 25      ┆ 0      ┆ 54          │
    │ 29      ┆ 4      ┆ 58          │
    └─────────┴────────┴─────────────┘
    

    There are a couple gotchas that this avoids:

    1. anything with strings won't work with numba because of the different ways each library treats them. (I might be wrong here so I'd be happy to be corrected)
    2. If you create a ufunc with more than 2 inputs then you have to use structs to input them (see my linked github issue)

    Additionally, you can wrap that g into an expression which you then monkey patch to the pl.Expr namespace if you need to use it a lot, like this:

    def gexpr(self, to_add):
        if isinstance(to_add, str):
            to_add=pl.col(to_add)
        return g(self, to_add)
    pl.Expr.g=gexpr
    print(df.with_columns(cum_max_sum=pl.col('minimum').g('to_add'))[0:8])