Search code examples
pythonpandasdataframeduplicatesdrop-duplicates

How to drop duplicated values in one column for each id in Data Frame in Python Pandas?


I have Data Frame in Python Pandas like below:

data = {'id': [1, 1, 1, 1, 2, 2, 3, 3],
        'nps': [8, 8, 8, 8, 7, 7, 9, 9],
        'target': [True, True, True, True, False, False, True, True],
        'score': [0.56, 0.78, 0.56, 0.78,  0.6785, 0.42, 0.9, 0.63],
        'day': ['2023-02-15', '2023-02-15', '2023-02-22', '2023-02-22', '2023-06-10', '2023-06-10', '2023-07-01', '2023-07-01']}
df = pd.DataFrame(data)

enter image description here

And as you can see I have duplicates for each id in column score. I need to have only one score per id.

So, as a result I need something like for example below:

id | nps | target  | score  | day
---|-----|---------|--------|-----------
1  | 8   | True    | 0.56   | 2023-02-15
1  | 8   | True    | 0.56   | 2023-02-22
2  | 7   | False   | 0.42   | 2023-06-10
3  | 9   | True    | 0.90   | 2023-07-01

How can I do that in Python Pandas ?


Solution

  • Do you mean one score per id, per day? Because in your example you have id 1 repeating, but separate days.

    If that's the case, you can do something like this:

    df.drop_duplicates(subset=['id', 'day'], keep='first', inplace=True)
    

    If you need to drop all duplicates, regardless of their date, then just remove the 'day' subset.

    df.drop_duplicates(subset=['id'], keep='first', inplace=True)
    

    These snippets will keep the 'first' occurrence of each row/id combination, and drop the rest.