I have a dataset i'm trying to remove duplicate entries from. The lazyframe i'm working with is structured like this:
df = pl.from_repr("""
┌──────┬────────────┬──────────────────┬───────┐
│ id ┆ title ┆ type ┆ type2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞══════╪════════════╪══════════════════╪═══════╡
│ 1001 ┆ Research A ┆ journal article ┆ 35 │
│ 1002 ┆ Research B ┆ book chapter ┆ 41 │
│ 1003 ┆ Research C ┆ journal article ┆ 35 │
│ 1004 ┆ Research D ┆ conference paper ┆ 42 │
│ 1001 ┆ Research E ┆ journal article ┆ 35 │
│ 1002 ┆ Research F ┆ journal article ┆ 41 │
│ 1003 ┆ Research G ┆ ┆ 41 │
│ 1002 ┆ Research I ┆ book chapter ┆ 41 │
│ 1003 ┆ Research J ┆ journal article ┆ 35 │
└──────┴────────────┴──────────────────┴───────┘
""")
I want to remove entries that have the same id
, but there are actually different cases:
type
(e.g. 1001): keep the first one.type
: discard the ones with an empty string ("") as type, and then keep only the entries that respect the following pairs of type
and type2
:dict_df = pl.DataFrame({
"type": ['journal article', 'book chapter', 'book chapter'],
"type2": [35, 41, 42]
})
Expected output
id[i64] | title[str] | type[str] | type2[i64] |
---|---|---|---|
1001 | Research A | journal article | 35 |
1002 | Research B | book chapter | 41 |
1003 | Research C | journal article | 35 |
1004 | Research D | conference paper | 42 |
I've tried many things, mainly using the pl.when() expression, but I cannot work out a way to filter the groups.
(
df
.sort('type', descending=True)
.group_by("id")
.agg([
pl.when(pl.col("type").n_unique() > 1)
.then( ... )
.otherwise(pl.all().first())
])
)
preference = pl.DataFrame({
"type": ["journal article", "book chapter", "book chapter"],
"iris_type": [35, 41, 42],
"preference": [0, 1, 2]
})
joined = df.lazy().join(preference.lazy(), on=["type", "iris_type"], how="left")
out = (
joined.sort("preference", descending=True, nulls_last=True)
.group_by("id")
.first()
.drop("preference")
.collect()
)