Search code examples
pythonpython-3.xpython-polars

filter polars DataFrame based on when rows whose specific columns contain pairs from a list of pairs


In this example, on columns ["foo", "ham"], I want rows 1 and 4 to be removed since they match a pair in the list

df = pl.DataFrame(
    {
        "foo": [1, 1, 2, 2, 3, 3, 4],
        "bar": [6, 7, 8, 9, 10, 11, 12],
        "ham": ["a", "b", "c", "d", "e", "f", "b"]
    }
)
pairs = [(1,"b"),(3,"e"),(4,"g")]

The following worked for me but I think this will be problematic when the dataframe and list of pairs are large.

for a, b in pairs:
    df = df.filter(~(pl.col('foo') == a) | ~(pl.col('ham') == b))

I think this is the pandas implementation for this problem Pandas: How to remove rows from a dataframe based on a list of tuples representing values in TWO columns?

I am not sure what the Polars implementation of it is.

(I think this problem can be generalized to any number of selected columns and any number of elements in a group. For instance, rather than a list of pairs, it can be another dataframe. You get the 'set difference', in terms of rows, of the two dataframes based on specific columns.)


Solution

  • It looks like an ANTI JOIN

    schema = ["foo", "ham"]
    
    (df.with_row_index() # just to show what "row numbers" were "removed"
       .join(
           pl.DataFrame(pairs, orient="row", schema=schema), 
           on = schema, 
           how = "anti"
       )
    )
    
    shape: (5, 4)
    ┌───────┬─────┬─────┬─────┐
    │ index ┆ foo ┆ bar ┆ ham │
    │ ---   ┆ --- ┆ --- ┆ --- │
    │ u32   ┆ i64 ┆ i64 ┆ str │
    ╞═══════╪═════╪═════╪═════╡
    │ 0     ┆ 1   ┆ 6   ┆ a   │
    │ 2     ┆ 2   ┆ 8   ┆ c   │
    │ 3     ┆ 2   ┆ 9   ┆ d   │
    │ 5     ┆ 3   ┆ 11  ┆ f   │
    │ 6     ┆ 4   ┆ 12  ┆ b   │
    └───────┴─────┴─────┴─────┘