I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
schema = ["a","b","c","d"],
orient = "row",
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
schema = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = "row",
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in
construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44
this will work on the struct
datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct
and call the is_in
expression. (A conversion to struct
is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
schema=["a", "b", "c", "d"],
orient="row",
)
df.filter(
pl.struct("a", "b").is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
The idiomatic way to filter data by presence in another DataFrame
are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
The reason why these joins are preferred over is_in
is that they are much faster and currently allow for more optimization.