I have a large dataset where I need to remove some duplicates from a pandas dataframe, but not all. In the example data below, each product record has the product name, year of the record, and a reference number. In most cases, a product should only have one reference number (the most recent) but if one product has multiple reference numbers that are equally recent I need to keep both.
So the code I'm looking to implement would do the following:
Product Year Reference
0 product1 2023 ref1
1 product1 2022 ref2
2 product2 2023 ref1
3 product2 2023 ref2
4 product2 2022 ref1
5 product3 2022 ref1
6 product3 2022 ref2
7 product3 2022 ref1 |
Hope I've explained that well enough!
At a bit of a loss on this one, I have searched around and unable to find a similar issue on here...
You can use groupby and a mask to keep only the latest year within each sub frame:
df = df.groupby('Product', sort=False).apply(lambda product_df : product_df[product_df.Year == product_df.Year.max()].drop_duplicates()).reset_index(drop=True)