Search code examples
pythonpandasduplicatesconditional-statementsdelete-row

How do I delete duplicates based on a condition? Python/Pandas


So I have a df that looks something like this:

Person_ID    Procedure_ID   Date(d/m/y)
34           30             03/03/2011
34           30             02/03/2011
32           19             01/01/2020
34           32             01/04/2012

If a Person_ID has the same procedure twice or more, like that 34 - 30 case above, the code needs to keep only the newest row, deleting all the others. In the example I gave the expected result would be:

Person_ID    Procedure_ID   Date(d/m/y)
34           30             02/03/2011
32           19             01/01/2020
34           32             01/04/2012

Thank you in advance!


Solution

  • Groupby 'Person_ID', 'Procedure_ID' and get the last element from each group.

    Code

    df.sort_values(by='Date(d/m/y)').groupby(['Person_ID', 'Procedure_ID'], as_index=False).last()
    

    Output

    Person_ID   Procedure_ID    Date(d/m/y)
    0   32      19              01/01/2020
    1   34      30              02/03/2011
    2   34      32              01/04/2012