Given the following table, I'd like to remove the duplicates based on the column subset col1,col2
. I'd like to keep the first row of the duplicates though:
data = {
'col1': [1, 2, 3, 1, 1],
'col2': [7, 8, 9, 7, 7],
'col3': [3, 4, 5, 6, 8]
}
tmp = pl.DataFrame(data)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ 7 ┆ 3 │
│ 2 ┆ 8 ┆ 4 │
│ 3 ┆ 9 ┆ 5 │
│ 1 ┆ 7 ┆ 6 │
│ 1 ┆ 7 ┆ 9 │
└──────┴──────┴──────┘
The result should be
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╡
│ 1 ┆ 7 ┆ 3 │
│ 2 ┆ 8 ┆ 4 │
│ 3 ┆ 9 ┆ 5 │
└──────┴──────┴──────┘
Usually I'd do this with pandas df["col1","col2"].is_duplicated(keep='first')
, but polars dl.is_duplicated()
marks all rows as duplicates including the first occurence.
You can use DataFrame.unique
; the flexible keep
keyword argument is here in Polars.
tmp.unique(('col1', 'col2'), keep='first', maintain_order=True)