Let's say I have an auth
field that use bit flags to indicate permissions (example bit-0 means add
and bit-1 means delete
).
How do I bitwise-OR
them together?
import polars as pl
df_in = pl.DataFrame(
{
"k": ["a", "a", "b", "b", "c"],
"auth": [1, 3, 1, 0, 0],
}
)
The dataframe:
df_in: shape: (5, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 1 │
│ a ┆ 3 │
│ b ┆ 1 │
│ b ┆ 0 │
│ c ┆ 0 │
└─────┴──────┘
When I group by and sum, things look good, I sum the auth
by k
dfsum = df_in.group_by("k").agg(pl.col("auth").sum())
dfsum: shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 4 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
So, it looks as if I am using group_by
and agg
correctly, when using sum
.
Not so good when using or_
.
dfor = df_in.group_by("k").agg(pl.col("auth").or_())
gives
dfor: shape: (3, 2)
┌─────┬───────────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═════╪═══════════╡
│ a ┆ [1, 3] │
│ b ┆ [1, 0] │
│ c ┆ [0] │
└─────┴───────────┘
for the or_
I was expecting this result instead:
df_wanted_or: shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 3 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
Now, I did find a workaround by using map_batches
to call a Python function. Very simple something like
functools.reduce(lambda x,y: x|y)
but how do I do this without leaving Polars?
Update.
Bitwise aggregation was implemented in version 1.9.0
. So now you can use pl.Expr.bitwise_or()
:
(
df_in
.group_by("k", maintain_order=True)
.agg(pl.col.auth.bitwise_or())
)
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 3 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
Previous answer. Bitwise aggregation is not yet implemented in polars - issue. There're a few ways you could approach it though:
unique()
- not strictly necessary, but can reduce size of the aggregated lists.list.to_struct()
to convert aggregated data to Struct..reduce()
to apply bitwise or
operator.field()
to access all the fields of the Struct within reduce
context.(
df_in
.group_by("k")
.agg(pl.col.auth.unique())
.with_columns(pl.col.auth.list.to_struct())
.with_columns(
auth = pl.reduce(
lambda acc, x: acc | x,
exprs = pl.col.auth.struct.field("*")
).fill_null(0)
)
)
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ b ┆ 1 │
│ a ┆ 3 │
│ c ┆ 0 │
└─────┴──────┘
You can use DuckDB integration with Polars and bit_or()
;
duckdb.sql("""
select
k,
bit_or(auth) as auth
from df_in
group by
k
""").pl()
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ a ┆ 3 │
│ b ┆ 1 │
│ c ┆ 0 │
└─────┴──────┘
Another possible way to do that would be to use Polars integration with NumPy.
First, use pure polars to aggregate auth
columns to lists and convert them to arrays.
df_agg = df_in.group_by("k").agg("auth")
w = df_agg["auth"].list.len().max()
df_agg = (
df_agg
.with_columns(
pl.col.auth.list.concat(
pl.lit(0).repeat_by(w - pl.col.auth.list.len())
)
).with_columns(pl.col.auth.list.to_array(w))
)
shape: (3, 2)
┌─────┬───────────────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ array[i64, 2] │
╞═════╪═══════════════╡
│ b ┆ [1, 0] │
│ a ┆ [1, 3] │
│ c ┆ [0, 0] │
└─────┴───────────────┘
Now we can get auth
column as Series, convert it to 2d numpy array with to_numpy()
and use np.bitwise_or
and reduce()
:
(
df_agg
.with_columns(
auth = np.bitwise_or.reduce(df_agg["auth"].to_numpy(), axis=1)
)
)
shape: (3, 2)
┌─────┬──────┐
│ k ┆ auth │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ b ┆ 1 │
│ a ┆ 3 │
│ c ┆ 0 │
└─────┴──────┘