In the following dataframe, I want to populate the column cum_max_sum
using the columns minimum
and to_add
. How can I achieve this using Polars operations?
┌──────────────────┬─────────┬────────┬─────────────┐
│ prev_cum_max_sum ┆ minimum ┆ to_add ┆ cum_max_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════════════╪═════════╪════════╪═════════════╡
│ null ┆ 1 ┆ 5 ┆ 6 │
│ 6 ┆ 4 ┆ 6 ┆ 12 │
│ 12 ┆ 13 ┆ 7 ┆ 20 │
└──────────────────┴─────────┴────────┴─────────────┘
Explanation:
max(null, 1) + 5 = 6
max(6, 4) + 6 = 12
max(12, 13) + 7 = 20
polars is vectorized so you can't really condition the next value on the previous value except with clever uses of cum___
functions or maybe dynamic groupbys.
You can take advantage of numba's ability to make vectorized ufuncs and polars's ability to use ufuncs seamlessly.
import numba as nb
@nb.guvectorize([(nb.int64[:], nb.int64[:], nb.int64[:])], '(n),(n)->(n)', nopython=True)
def g(x,y, res):
res[0]=x[0]+y[0]
for i in range(1, len(x)):
if res[i-1]>x[i]:
res[i]=res[i-1]+y[i]
else:
res[i]=x[i]+y[i]
Now you have a ufunc called g
which can do what you need.
Let's make a bigger df and try it out.
import polars as pl
import numpy as np
np.random.seed(10)
df=pl.DataFrame({'minimum':np.random.randint(0,50,100),
'to_add':np.random.randint(0,5,100)})
print(df.with_columns(cum_max_sum=g(pl.col('minimum'), pl.col('to_add')))[0:8])
shape: (8, 3)
┌─────────┬────────┬─────────────┐
│ minimum ┆ to_add ┆ cum_max_sum │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════════╪════════╪═════════════╡
│ 9 ┆ 0 ┆ 9 │
│ 36 ┆ 1 ┆ 37 │
│ 15 ┆ 1 ┆ 38 │
│ 0 ┆ 0 ┆ 38 │
│ 49 ┆ 2 ┆ 51 │
│ 28 ┆ 3 ┆ 54 │
│ 25 ┆ 0 ┆ 54 │
│ 29 ┆ 4 ┆ 58 │
└─────────┴────────┴─────────────┘
There are a couple gotchas that this avoids:
Additionally, you can wrap that g
into an expression which you then monkey patch to the pl.Expr
namespace if you need to use it a lot, like this:
def gexpr(self, to_add):
if isinstance(to_add, str):
to_add=pl.col(to_add)
return g(self, to_add)
pl.Expr.g=gexpr
print(df.with_columns(cum_max_sum=pl.col('minimum').g('to_add'))[0:8])