Search code examples
pythonpython-polars

Cumulative calculation across rows?


Suppose I have a function:

def f(prev, curr):
  return prev * 2 + curr

(Just an example, could have been anything)

And a Polars dataframe:

| some_col | other_col |
|----------|-----------|
|    7     |    ...
|    3     |
|    9     |
|    2     |

I would like to use f on my dataframe cumulatively, and the output would be:

| some_col | other_col |
|----------|-----------|
|    7     |    ...
|    17    |
|    43    |
|    88    |

I understand that, naturally, this type of calculation isn't going to be very efficient since it has to be done one row at a time (at least in the general case).

I can obviously loop over rows. But is there an elegant, idiomatic way to do this in Polars?


Solution

  • It depends on the exact operation you need to perform.

    The example you've given can be expressed in terms of .cum_sum() with additional arithmetic:

    def plus_prev_times_2(col):
        x = 2 ** pl.int_range(pl.len() - 1).reverse()
        y = 2 ** pl.int_range(1, pl.len())
        cs = (x * col.slice(1)).cum_sum()
        return cs / x + col.first() * y
    
    df = pl.DataFrame({"some_col": [7, 3, 9, 2]})
    
    df.with_columns(
       pl.col.some_col.first()
         .append(pl.col.some_col.pipe(plus_prev_times_2))
         .alias("plus_prev_times_2")
    )     
    
    shape: (4, 2)
    ┌──────────┬───────────────────┐
    │ some_col ┆ plus_prev_times_2 │
    │ ---      ┆ ---               │
    │ i64      ┆ f64               │
    ╞══════════╪═══════════════════╡
    │ 7        ┆ 7.0               │
    │ 3        ┆ 17.0              │
    │ 9        ┆ 43.0              │
    │ 2        ┆ 88.0              │
    └──────────┴───────────────────┘
    

    Vertical fold/scan

    In general, I believe what you're asking for is called a "Vertical fold/scan"

    Polars only offers a horizontal version, pl.cum_fold

    df = pl.DataFrame(dict(a=[7], b=[3], c=[9], d=[2]))
    
    df.with_columns(
       pl.cum_fold(acc=0, function=lambda acc, x: acc * 2 + x, exprs=pl.all())
    )
    
    shape: (1, 5)
    ┌─────┬─────┬─────┬─────┬──────────────┐
    │ a   ┆ b   ┆ c   ┆ d   ┆ cum_fold     │
    │ --- ┆ --- ┆ --- ┆ --- ┆ ---          │
    │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ struct[4]    │
    ╞═════╪═════╪═════╪═════╪══════════════╡
    │ 7   ┆ 3   ┆ 9   ┆ 2   ┆ {7,17,43,88} │
    └─────┴─────┴─────┴─────┴──────────────┘
    

    As discussed in the issue, a vertical equivalent would be hugely inefficient.

    For an efficient approach, you can write plugins in Rust:

    But using something like numba is probably easier to implement.

    There are several existing numba answers, e.g.