Running into some difficulty trying to drop correct duplicates from a dataframe.
I have the following example:
import numpy as np
import pandas as pd
test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10',
'2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
'value' : [123, '', 324, '', '', '', 321],}
df = pd.DataFrame(data=test)
The output can be seen below:
date value
0 2012-10-12 10:10:10 123
1 2012-10-12 10:10:10
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
4 2012-11-02 16:08:07
5 2012-12-12 23:45:21
6 2012-12-12 23:45:21 321
My desired output after dropping duplicate dates is as shown below:
date value
0 2012-10-12 10:10:10 123
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
6 2012-12-12 23:45:21 321
However, my attempts to date have been unsuccessful as shown below:
Attempt 1:-
df = df.drop_duplicates(subset='date')
date value
0 2012-10-12 10:10:10 123
2 2012-10-19 10:55:10 324
3 2012-11-02 16:08:07
5 2012-12-12 23:45:21
Attempt 2:-
df = df.drop_duplicates(subset='date', keep='last')
date value
1 2012-10-12 10:10:10
2 2012-10-19 10:55:10 324
4 2012-11-02 16:08:07
6 2012-12-12 23:45:21 321
Please can you assist with helping me reach my desired output. Many thanks in advance
One approach is to mask the empty strings in the column value
, then groupby on date
and aggregate using first
:
df['value'].mask(df['value'].eq('')).groupby(df['date']).first().fillna('').reset_index()
Alternatively you can mask the empty strings in the column value
and assign it to temporary column key
, then sort the dataframe on columns date
and key
, followed by drop_duplicates
:
df['key'] = df['value'].mask(df['value'].eq(''))
df.sort_values(['date', 'key']).drop_duplicates('date').drop('key', 1)
Result:
date value
0 2012-10-12 10:10:10 123
1 2012-10-19 10:55:10 324
2 2012-11-02 16:08:07
3 2012-12-12 23:45:21 321