Search code examples
pythonpandasconditional-statementsduplicatesrow-removal

Is there a way to remove duplicates in pandas dataframe based on relationship between the duplicates?


I have a large dataset where I need to remove some duplicates from a pandas dataframe, but not all. In the example data below, each product record has the product name, year of the record, and a reference number. In most cases, a product should only have one reference number (the most recent) but if one product has multiple reference numbers that are equally recent I need to keep both.

So the code I'm looking to implement would do the following:

  • In the case of "product1" it would remove the older record from 2022 (row index 1) and keep row index 0.
  • In the case of "product2" however, because both of the most recent references are from 2023 it would recognise that they both need to be kept, so the only duplicate that needs removing is row index 4.
  • In the case of "product3", all three records are from 2022 so it would recognise that rows 5 & 6 both need to be kept, and row 7 needs deleting as it is an exact duplicate of row 5.
  Product  Year Reference
0  product1  2023      ref1
1  product1  2022      ref2
2  product2  2023      ref1
3  product2  2023      ref2
4  product2  2022      ref1
5  product3  2022      ref1
6  product3  2022      ref2
7  product3  2022      ref1 |

Hope I've explained that well enough!

At a bit of a loss on this one, I have searched around and unable to find a similar issue on here...


Solution

  • You can use groupby and a mask to keep only the latest year within each sub frame:

    df = df.groupby('Product', sort=False).apply(lambda product_df : product_df[product_df.Year == product_df.Year.max()].drop_duplicates()).reset_index(drop=True)