Search code examples
pythonaggregatebitwise-operatorspython-polars

Can I perform a bit-wise group by and aggregation with Polars `or_`?


Let's say I have an auth field that use bit flags to indicate permissions (example bit-0 means add and bit-1 means delete).

How do I bitwise-OR them together?

import polars as pl

df_in = pl.DataFrame(
    {
        "k": ["a", "a", "b", "b", "c"],
        "auth": [1, 3, 1, 0, 0],
    }
)

The dataframe:

df_in: shape: (5, 2)
┌─────┬──────┐
│ k   ┆ auth │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ a   ┆ 1    │
│ a   ┆ 3    │
│ b   ┆ 1    │
│ b   ┆ 0    │
│ c   ┆ 0    │
└─────┴──────┘

When I group by and sum, things look good, I sum the auth by k

dfsum = df_in.group_by("k").agg(pl.col("auth").sum())
dfsum: shape: (3, 2)
┌─────┬──────┐
│ k   ┆ auth │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ a   ┆ 4    │
│ b   ┆ 1    │
│ c   ┆ 0    │
└─────┴──────┘

So, it looks as if I am using group_by and agg correctly, when using sum.

Not so good when using or_.

dfor = df_in.group_by("k").agg(pl.col("auth").or_())

gives

dfor: shape: (3, 2)
┌─────┬───────────┐
│ k   ┆ auth      │
│ --- ┆ ---       │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ a   ┆ [1, 3]    │
│ b   ┆ [1, 0]    │
│ c   ┆ [0]       │
└─────┴───────────┘

Expectations:

for the or_ I was expecting this result instead:

df_wanted_or: shape: (3, 2)
┌─────┬──────┐
│ k   ┆ auth │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ a   ┆ 3    │
│ b   ┆ 1    │
│ c   ┆ 0    │
└─────┴──────┘

Now, I did find a workaround by using map_batches to call a Python function. Very simple something like

functools.reduce(lambda x,y: x|y)

but how do I do this without leaving Polars?


Solution

  • Update. Bitwise aggregation was implemented in version 1.9.0. So now you can use pl.Expr.bitwise_or():

    (
        df_in
        .group_by("k", maintain_order=True)
        .agg(pl.col.auth.bitwise_or())
    )
    
    shape: (3, 2)
    ┌─────┬──────┐
    │ k   ┆ auth │
    │ --- ┆ ---  │
    │ str ┆ i64  │
    ╞═════╪══════╡
    │ a   ┆ 3    │
    │ b   ┆ 1    │
    │ c   ┆ 0    │
    └─────┴──────┘
    

    Previous answer. Bitwise aggregation is not yet implemented in polars - issue. There're a few ways you could approach it though:

    1. Pure polars solution.

    • unique() - not strictly necessary, but can reduce size of the aggregated lists.
    • list.to_struct() to convert aggregated data to Struct.
    • .reduce() to apply bitwise or operator.
    • field() to access all the fields of the Struct within reduce context.
    (
        df_in
        .group_by("k")
        .agg(pl.col.auth.unique())
        .with_columns(pl.col.auth.list.to_struct())
        .with_columns(
            auth = pl.reduce(
                lambda acc, x: acc | x,
                exprs = pl.col.auth.struct.field("*")
            ).fill_null(0)
        )
    )
    
    shape: (3, 2)
    ┌─────┬──────┐
    │ k   ┆ auth │
    │ --- ┆ ---  │
    │ str ┆ i64  │
    ╞═════╪══════╡
    │ b   ┆ 1    │
    │ a   ┆ 3    │
    │ c   ┆ 0    │
    └─────┴──────┘
    

    2. DuckDB integration with Polars.

    You can use DuckDB integration with Polars and bit_or();

    duckdb.sql("""
        select
            k,
            bit_or(auth) as auth
        from df_in
        group by
            k
    """).pl()
    
    shape: (3, 2)
    ┌─────┬──────┐
    │ k   ┆ auth │
    │ --- ┆ ---  │
    │ str ┆ i64  │
    ╞═════╪══════╡
    │ a   ┆ 3    │
    │ b   ┆ 1    │
    │ c   ┆ 0    │
    └─────┴──────┘
    

    3. Polars integration with NumPy

    Another possible way to do that would be to use Polars integration with NumPy. First, use pure polars to aggregate auth columns to lists and convert them to arrays.

    df_agg = df_in.group_by("k").agg("auth")
    
    w = df_agg["auth"].list.len().max()
    
    df_agg = (
        df_agg
        .with_columns(
            pl.col.auth.list.concat(
                pl.lit(0).repeat_by(w - pl.col.auth.list.len())
           )
        ).with_columns(pl.col.auth.list.to_array(w))
    )
    
    shape: (3, 2)
    ┌─────┬───────────────┐
    │ k   ┆ auth          │
    │ --- ┆ ---           │
    │ str ┆ array[i64, 2] │
    ╞═════╪═══════════════╡
    │ b   ┆ [1, 0]        │
    │ a   ┆ [1, 3]        │
    │ c   ┆ [0, 0]        │
    └─────┴───────────────┘
    

    Now we can get auth column as Series, convert it to 2d numpy array with to_numpy() and use np.bitwise_or and reduce():

    (
        df_agg
        .with_columns(
            auth = np.bitwise_or.reduce(df_agg["auth"].to_numpy(), axis=1)
        )
    )
    
    shape: (3, 2)
    ┌─────┬──────┐
    │ k   ┆ auth │
    │ --- ┆ ---  │
    │ str ┆ i64  │
    ╞═════╪══════╡
    │ b   ┆ 1    │
    │ a   ┆ 3    │
    │ c   ┆ 0    │
    └─────┴──────┘