Search code examples
pythonpython-polars

Conditional deduplication in polars


I have a dataset i'm trying to remove duplicate entries from. The lazyframe i'm working with is structured like this:

df = pl.from_repr("""
┌──────┬────────────┬──────────────────┬───────┐
│ id   ┆ title      ┆ type             ┆ type2 │
│ ---  ┆ ---        ┆ ---              ┆ ---   │
│ i64  ┆ str        ┆ str              ┆ i64   │
╞══════╪════════════╪══════════════════╪═══════╡
│ 1001 ┆ Research A ┆ journal article  ┆ 35    │
│ 1002 ┆ Research B ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research C ┆ journal article  ┆ 35    │
│ 1004 ┆ Research D ┆ conference paper ┆ 42    │
│ 1001 ┆ Research E ┆ journal article  ┆ 35    │
│ 1002 ┆ Research F ┆ journal article  ┆ 41    │
│ 1003 ┆ Research G ┆                  ┆ 41    │
│ 1002 ┆ Research I ┆ book chapter     ┆ 41    │
│ 1003 ┆ Research J ┆ journal article  ┆ 35    │
└──────┴────────────┴──────────────────┴───────┘
""")

I want to remove entries that have the same id, but there are actually different cases:

  1. The duplicates have the same type (e.g. 1001): keep the first one.
  2. The duplicates have a different type: discard the ones with an empty string ("") as type, and then keep only the entries that respect the following pairs of type and type2:
dict_df = pl.DataFrame({
    "type": ['journal article', 'book chapter', 'book chapter'],
    "type2": [35, 41, 42]
})

Expected output

id[i64] title[str] type[str] type2[i64]
1001 Research A journal article 35
1002 Research B book chapter 41
1003 Research C journal article 35
1004 Research D conference paper 42
  • 1001: same type, keep the first one
  • 1002: different type, keep the first occurrence of the entries with pair {'book chapter': 41}
  • 1003: different type, discard the entries with an empty type and keep the first occurrence
  • 1004: not a duplicate

I've tried many things, mainly using the pl.when() expression, but I cannot work out a way to filter the groups.

(
    df
    .sort('type', descending=True)
    .group_by("id")
    .agg([
        pl.when(pl.col("type").n_unique() > 1)
        .then( ... )
        .otherwise(pl.all().first())
    ])
)


Solution

    1. Create a preference table indicating how much you want each combination:
    preference = pl.DataFrame({
        "type": ["journal article", "book chapter", "book chapter"],
        "iris_type": [35, 41, 42],
        "preference": [0, 1, 2]
    })
    
    1. Join the preference table with your data table:
    joined = df.lazy().join(preference.lazy(), on=["type", "iris_type"], how="left")
    
    1. Sort the joined table on the preference, pick the first one in each group and drop the preference column:
    out = (
        joined.sort("preference", descending=True, nulls_last=True)
            .group_by("id")
            .first()
            .drop("preference")
            .collect()
    )