Search code examples
pythondataframevectorizationpython-polarsrolling-computation

Python (Polars): Vectorized operation of determining current solution with the use of previous variables


Let's say we have 3 variables a, b & c.

There are n instances of each, and all but the first instance of c are null.

We are to calculate each next c based on a given formula comprising of only present variables on the right hand side:

c = [(1 + a) * (current_c) * (b)] + [(1 + b) * (current_c) * (a)]

How do we go about this calculation without using native python looping? I've tried:

  • pl.int_range(my_index_column_value, pl.len() + 1) (my index starts form 1)
  • pl.rolling(...) (this seems to be quite an expensive operation)
  • pl.when(...).then(...).otherwise(...) with the above two along with .over(...) & pl.select(...).item()

to no avail. It's always the case that _the shift has already been fully made at once. I thought perhaps the most plausible way to do this would be either rolling by 1 with grouping by 2, or via pl.int_range(...) and using the current index column number as the shift value. However, these keep failing as I am unable to properly come up with the correct syntax - I'm unable to pass the index column value and have polars accept it as a number. Even casting throws the same errors. Right now I am thinking we could manage another row for shifting and passing values back to row c, but then again, I'm not sure if this would even be an efficient way to go about it...

What would be the most optimal way to go about this without offloading to Rust?

Code for reference:

import polars as pl

if __name__ == "__main__":
    initial_c_value = 3

    df = pl.DataFrame(((2, 3, 4, 5, 8), (3, 7, 4, 9, 2)), schema=('a', 'b'))
    df = df.with_row_index('i', 1).with_columns(pl.lit(None).alias('c'))

    df = df.with_columns(pl.when(pl.col('i') == 1)
    .then(
        (((1 + pl.col('a')) * (initial_c_value) * (pl.col('b'))) +
        ((1 + pl.col('b')) * (initial_c_value) * (pl.col('a')))).alias('c'))
    .otherwise(
        ((1 + pl.col('a')) * (pl.col('c').shift(1)) * (pl.col('b'))) +
        ((1 + pl.col('b')) * (pl.col('c').shift(1)) * (pl.col('a')))).shift(1).alias('c'))

    print(df)

Solution

  • Using numba you can make ufuncs which polars can use seamlessly.

    from numba import guvectorize, int64
    import polars as pl
    
    @guvectorize([(int64[:], int64[:], int64, int64[:])], '(n),(n),()->(n)', nopython=True)
    def make_c(a,b,init_c, res):
        res[0]=(1+a[0]) * init_c * b[0] + (1+b[0]) * init_c * a[0]
        for i in range(1,a.shape[0]):
            res[i] = (1+a[i]) * res[i-1] * b[i] + (1+b[i]) * res[i-1] * a[i]
            
    df = pl.DataFrame(((2, 3, 4, 5, 8), (3, 7, 4, 9, 2)), schema=('a', 'b'))
    
    df.with_columns(
        c=make_c(pl.col('a'), pl.col('b'), 3)
    )
    shape: (5, 3)
    ┌─────┬─────┬───────────┐
    │ a   ┆ b   ┆ c         │
    │ --- ┆ --- ┆ ---       │
    │ i64 ┆ i64 ┆ i64       │
    ╞═════╪═════╪═══════════╡
    │ 2   ┆ 3   ┆ 51        │
    │ 3   ┆ 7   ┆ 2652      │
    │ 4   ┆ 4   ┆ 106080    │
    │ 5   ┆ 9   ┆ 11032320  │
    │ 8   ┆ 2   ┆ 463357440 │
    └─────┴─────┴───────────┘
    

    The way it works is that the ufunc detects that its input is a polars Expr (ie pl.col() is an Expr) and then it hands control to polars. Because of that you can NOT just do make_c('a','b',3) as then its input is just a str and it won't know what to do with that.