Cumulative calculation across rows?

Suppose I have a function:

def f(prev, curr):
  return prev * 2 + curr

(Just an example, could have been anything)

And a Polars dataframe:

| some_col | other_col |
|----------|-----------|
|    7     |    ...
|    3     |
|    9     |
|    2     |

I would like to use f on my dataframe cumulatively, and the output would be:

| some_col | other_col |
|----------|-----------|
|    7     |    ...
|    17    |
|    43    |
|    88    |

I understand that, naturally, this type of calculation isn't going to be very efficient since it has to be done one row at a time (at least in the general case).

I can obviously loop over rows. But is there an elegant, idiomatic way to do this in Polars?

Solution

It depends on the exact operation you need to perform.

The example you've given can be expressed in terms of .cum_sum() with additional arithmetic:

def plus_prev_times_2(col):
    x = 2 ** pl.int_range(pl.len() - 1).reverse()
    y = 2 ** pl.int_range(1, pl.len())
    cs = (x * col.slice(1)).cum_sum()
    return cs / x + col.first() * y

df = pl.DataFrame({"some_col": [7, 3, 9, 2]})

df.with_columns(
   pl.col.some_col.first()
     .append(pl.col.some_col.pipe(plus_prev_times_2))
     .alias("plus_prev_times_2")
)

shape: (4, 2)
┌──────────┬───────────────────┐
│ some_col ┆ plus_prev_times_2 │
│ ---      ┆ ---               │
│ i64      ┆ f64               │
╞══════════╪═══════════════════╡
│ 7        ┆ 7.0               │
│ 3        ┆ 17.0              │
│ 9        ┆ 43.0              │
│ 2        ┆ 88.0              │
└──────────┴───────────────────┘

Vertical fold/scan

In general, I believe what you're asking for is called a "Vertical fold/scan"

https://github.com/pola-rs/polars/issues/12165

Polars only offers a horizontal version, pl.cum_fold

df = pl.DataFrame(dict(a=[7], b=[3], c=[9], d=[2]))

df.with_columns(
   pl.cum_fold(acc=0, function=lambda acc, x: acc * 2 + x, exprs=pl.all())
)

shape: (1, 5)
┌─────┬─────┬─────┬─────┬──────────────┐
│ a   ┆ b   ┆ c   ┆ d   ┆ cum_fold     │
│ --- ┆ --- ┆ --- ┆ --- ┆ ---          │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ struct[4]    │
╞═════╪═════╪═════╪═════╪══════════════╡
│ 7   ┆ 3   ┆ 9   ┆ 2   ┆ {7,17,43,88} │
└─────┴─────┴─────┴─────┴──────────────┘

As discussed in the issue, a vertical equivalent would be hugely inefficient.

For an efficient approach, you can write plugins in Rust:

https://marcogorelli.github.io/polars-plugins-tutorial/cum_sum/

But using something like numba is probably easier to implement.

There are several existing numba answers, e.g.

Python (Polars): Vectorized operation of determining current solution with the use of previous variables