In this example, on columns ["foo", "ham"]
, I want rows 1 and 4 to be removed since they match a pair in the list
df = pl.DataFrame(
{
"foo": [1, 1, 2, 2, 3, 3, 4],
"bar": [6, 7, 8, 9, 10, 11, 12],
"ham": ["a", "b", "c", "d", "e", "f", "b"]
}
)
pairs = [(1,"b"),(3,"e"),(4,"g")]
The following worked for me but I think this will be problematic when the dataframe and list of pairs are large.
for a, b in pairs:
df = df.filter(~(pl.col('foo') == a) | ~(pl.col('ham') == b))
I think this is the pandas implementation for this problem Pandas: How to remove rows from a dataframe based on a list of tuples representing values in TWO columns?
I am not sure what the Polars implementation of it is.
(I think this problem can be generalized to any number of selected columns and any number of elements in a group. For instance, rather than a list of pairs, it can be another dataframe. You get the 'set difference', in terms of rows, of the two dataframes based on specific columns.)
It looks like an ANTI JOIN
schema = ["foo", "ham"]
(df.with_row_index() # just to show what "row numbers" were "removed"
.join(
pl.DataFrame(pairs, orient="row", schema=schema),
on = schema,
how = "anti"
)
)
shape: (5, 4)
┌───────┬─────┬─────┬─────┐
│ index ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ str │
╞═══════╪═════╪═════╪═════╡
│ 0 ┆ 1 ┆ 6 ┆ a │
│ 2 ┆ 2 ┆ 8 ┆ c │
│ 3 ┆ 2 ┆ 9 ┆ d │
│ 5 ┆ 3 ┆ 11 ┆ f │
│ 6 ┆ 4 ┆ 12 ┆ b │
└───────┴─────┴─────┴─────┘