Search code examples
python-polars

Lightweight syntax for filtering a polars DataFrame on a multi-column key?


I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).

Using the following frame as an example:

import polars as pl
df = pl.DataFrame( 
    data = [
        ["x",123, 4.5, "misc"],
        ["y",456,10.0,"other"],
        ["z",789,99.5,"value"],
    ],
    schema = ["a","b","c","d"],
    orient = "row",
)

A PostgreSQL statement could use a VALUES expression, like so...

(("a","b") IN (VALUES ('x',123),('y',456)))

...and a pandas equivalent might set a multi-column index.

pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]

The polars syntax would look like this:

df.join( 
    pl.DataFrame(
        data = [('x',123),('y',456)],
        schema = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
        orient = "row",
    ),
    on = ["a","b"],
    how = "inner", # or 'anti' for "not in"
)

Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):

df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )

Solution

  • In the next polars release >0.13.44 this will work on the struct datatype.

    We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)

    df = pl.DataFrame(
        data=[
            ["x", 123, 4.5, "misc"],
            ["y", 456, 10.0, "other"],
            ["z", 789, 99.5, "value"],
        ],
        schema=["a", "b", "c", "d"],
        orient="row",
    )
    
    df.filter(
        pl.struct("a", "b").is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
    )
    
    shape: (2, 4)
    ┌─────┬─────┬──────┬───────┐
    │ a   ┆ b   ┆ c    ┆ d     │
    │ --- ┆ --- ┆ ---  ┆ ---   │
    │ str ┆ i64 ┆ f64  ┆ str   │
    ╞═════╪═════╪══════╪═══════╡
    │ x   ┆ 123 ┆ 4.5  ┆ misc  │
    │ y   ┆ 456 ┆ 10.0 ┆ other │
    └─────┴─────┴──────┴───────┘
    
    

    Filtering by data in another DataFrame.

    The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.

    • semi: keep rows/keys that are in both DataFrames
    • anti: remove rows/keys that are in both DataFrames

    The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.