I have Data Frame in Python Pandas like below:
data = {'id': [1, 1, 1, 1, 2, 2, 3, 3],
'nps': [8, 8, 8, 8, 7, 7, 9, 9],
'target': [True, True, True, True, False, False, True, True],
'score': [0.56, 0.78, 0.56, 0.78, 0.6785, 0.42, 0.9, 0.63],
'day': ['2023-02-15', '2023-02-15', '2023-02-22', '2023-02-22', '2023-06-10', '2023-06-10', '2023-07-01', '2023-07-01']}
df = pd.DataFrame(data)
And as you can see I have duplicates for each id in column score. I need to have only one score per id.
So, as a result I need something like for example below:
id | nps | target | score | day
---|-----|---------|--------|-----------
1 | 8 | True | 0.56 | 2023-02-15
1 | 8 | True | 0.56 | 2023-02-22
2 | 7 | False | 0.42 | 2023-06-10
3 | 9 | True | 0.90 | 2023-07-01
How can I do that in Python Pandas ?
Do you mean one score per id, per day? Because in your example you have id 1 repeating, but separate days.
If that's the case, you can do something like this:
df.drop_duplicates(subset=['id', 'day'], keep='first', inplace=True)
If you need to drop all duplicates, regardless of their date, then just remove the 'day' subset.
df.drop_duplicates(subset=['id'], keep='first', inplace=True)
These snippets will keep the 'first' occurrence of each row/id combination, and drop the rest.