Search code examples
pythonpandaspython-polars

Polars - drop duplicate row based on column subset but keep first


Given the following table, I'd like to remove the duplicates based on the column subset col1,col2. I'd like to keep the first row of the duplicates though:

data = {
    'col1': [1, 2, 3, 1, 1],
    'col2': [7, 8, 9, 7, 7],
    'col3': [3, 4, 5, 6, 8]
}
tmp = pl.DataFrame(data)
┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ 7    ┆ 3    │
│ 2    ┆ 8    ┆ 4    │
│ 3    ┆ 9    ┆ 5    │
│ 1    ┆ 7    ┆ 6    │
│ 1    ┆ 7    ┆ 9    │
└──────┴──────┴──────┘

The result should be

┌──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ 1    ┆ 7    ┆ 3    │
│ 2    ┆ 8    ┆ 4    │
│ 3    ┆ 9    ┆ 5    │
└──────┴──────┴──────┘

Usually I'd do this with pandas df["col1","col2"].is_duplicated(keep='first'), but polars dl.is_duplicated() marks all rows as duplicates including the first occurence.


Solution

  • You can use DataFrame.unique; the flexible keep keyword argument is here in Polars.

    tmp.unique(('col1', 'col2'), keep='first', maintain_order=True)