Search code examples
numbapython-polarspython-c-api

Polars complex function via the numba JIT (circumventing the return entity limit of 1)


In a Polars DataFrame, I have 3 columns (A, B, D) storing value of type double. For n rows, and a starting value of some double number, the calculation is performed as shown below:

value of column A = value of column A * initial value
value of column B = value of column B * initial value
value of column D = value of column A + value of column B
initial value = value of column D

Unfortunately, complex custom functions of this nature can only be satisfactorily performed via numba gufuncs as of Polars (v0.20.26).

My Issue is, I am only able to return 1 element from the gufunc and still have it be compatible with Polars (Polars does not support multiple returns as of yet). So in the end, even though I get my final results for column D, the values for column A & B are not retained.

One way to circumvent this would be to simply perform the same computations two more times solving for the desired column's values - but this is rather an inefficient way to go about this. I also tried to hoodwink polars and numba by making the value of column D an array of [column A, column B, column D] but numba does not allow this with Polars columns as a message is returned informing of the data being elided (which I've noticed, is usually a signal of numba's incompatibility with data). I've also tried turning the column information into a comma-combined string to no avail - we just can't pass complex data like that to numba - not as far as I know.

I wrote a custom C function, but for some very odd reason, Python is so abysmally slow in calling a C function form it's native API that it takes about a 100 times longer than the polars complex function as described above. I can only guess this is due to taking time wrapping everything into PyObjects - even though I got the same slow result using numpy arrays... There is simply no way for me to take it to a lower level with the provided Python C API as the calls are abnormally slow for some reason - regardless of how many parameters I pass or how complex my data seems to be.

I am at my wits end at the moment. The only options I can think of at the moment are:

  1. Convert this portion into a function consisting & dealing only with numpy arrays on the numba JIT.
  2. Store the dataframe column data in something like pickle, SQLite & get to multiple sets of these with a final light-year long call signal to C just before terminating the Python routine.

I wanted to post here first to see if there was an easier way out of this pickle (no pun intended).


Solution

  • In general, polars can handle multiple returns. They just need to be packed into a polars series (i.e. use a Struct series). From the documentation:

    The output of this custom function must be a Series (or a NumPy array, in which case it will be automatically converted into a Series).

    As a simple example, consider the following.

    import polars as pl
    
    df = pl.DataFrame({
        "A": [1, 2, 3],
        "B": [1, 1, 1],
        "C": [0, 1, 0],
    })
    
    df_new = df.with_columns(
        pl.col("A").map_batches(
            lambda x: pl.Series({"D": x[i], "E": 2 * x[i]} for i in range(len(x)))
        ).alias("result")
    )
    
    shape: (3, 4)
    ┌─────┬─────┬─────┬───────────┐
    │ A   ┆ B   ┆ C   ┆ result    │
    │ --- ┆ --- ┆ --- ┆ ---       │
    │ i64 ┆ i64 ┆ i64 ┆ struct[2] │
    ╞═════╪═════╪═════╪═══════════╡
    │ 1   ┆ 1   ┆ 0   ┆ {1,2}     │
    │ 2   ┆ 1   ┆ 1   ┆ {2,4}     │
    │ 3   ┆ 1   ┆ 0   ┆ {3,6}     │
    └─────┴─────┴─────┴───────────┘
    

    The "result" column can be unnested into the struct fields "D" and "E" using pl.DataFrame.unnest.

    df_new.unnest("result")
    
    shape: (3, 5)
    ┌─────┬─────┬─────┬─────┬─────┐
    │ A   ┆ B   ┆ C   ┆ D   ┆ E   │
    │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
    │ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
    ╞═════╪═════╪═════╪═════╪═════╡
    │ 1   ┆ 1   ┆ 0   ┆ 1   ┆ 2   │
    │ 2   ┆ 1   ┆ 1   ┆ 2   ┆ 4   │
    │ 3   ┆ 1   ┆ 0   ┆ 3   ┆ 6   │
    └─────┴─────┴─────┴─────┴─────┘
    

    A common pattern here is to have a pure numpy function that is numba jit-compatible and a simpler wrapper handling the conversion (polars to numpy and numpy to polars). In your concrete case this could look as follows.

    import numpy as np
    import numba
    
    @numba.jit(nopython=True)
    def f(a, b, initial_value=1.0):
        a_new = np.empty_like(a)
        b_new = np.empty_like(b)
        d_new = np.empty_like(a)
    
        carry = initial_value
        for i in range(len(a)):
            a_new[i] = a[i] * carry
            b_new[i] = b[i] * carry
            d_new[i] = carry = a_new[i] * b_new[i]
    
        return a_new, b_new, d_new
    
    def unpack(a_new, b_new, d_new):
        df = pl.DataFrame({
            "A_new": a_new,
            "B_new": b_new,
            "D_new": d_new,
        })
        return pl.Series(df.to_dicts())
    
    (
        df
        .with_columns(
            pl.struct("A", "B").map_batches(
                lambda x: unpack(
                    *f(
                        x.struct.field("A").to_numpy(), 
                        x.struct.field("B").to_numpy(),
                    )
                )
            ).alias("result")
        )
    )
    
    shape: (3, 4)
    ┌─────┬─────┬─────┬───────────┐
    │ A   ┆ B   ┆ C   ┆ result    │
    │ --- ┆ --- ┆ --- ┆ ---       │
    │ i64 ┆ i64 ┆ i64 ┆ struct[3] │
    ╞═════╪═════╪═════╪═══════════╡
    │ 1   ┆ 1   ┆ 0   ┆ {1,1,1}   │
    │ 2   ┆ 1   ┆ 1   ┆ {2,1,2}   │
    │ 3   ┆ 1   ┆ 0   ┆ {6,2,12}  │
    └─────┴─────┴─────┴───────────┘
    

    Note. With some more thought, the conversion of the numpy arrays to the polars Struct series (in unpack) can likely be made a bit more elegantly.