Search code examples
pythonpandasduplicatesdrop

df.drop_duplicates python


Running into some difficulty trying to drop correct duplicates from a dataframe.

I have the following example:

import numpy as np
import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, '', 324, '', '', '', 321],}

df = pd.DataFrame(data=test)

The output can be seen below:

                  date value
0  2012-10-12 10:10:10   123
1  2012-10-12 10:10:10      
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
4  2012-11-02 16:08:07      
5  2012-12-12 23:45:21      
6  2012-12-12 23:45:21   321

My desired output after dropping duplicate dates is as shown below:

                  date value
0  2012-10-12 10:10:10   123
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
6  2012-12-12 23:45:21   321 

However, my attempts to date have been unsuccessful as shown below:

Attempt 1:-

df = df.drop_duplicates(subset='date')

                  date value
0  2012-10-12 10:10:10   123
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
5  2012-12-12 23:45:21      

Attempt 2:-

df = df.drop_duplicates(subset='date', keep='last')

                  date value
1  2012-10-12 10:10:10      
2  2012-10-19 10:55:10   324
4  2012-11-02 16:08:07      
6  2012-12-12 23:45:21   321

Please can you assist with helping me reach my desired output. Many thanks in advance


Solution

  • One approach is to mask the empty strings in the column value, then groupby on date and aggregate using first:

    df['value'].mask(df['value'].eq('')).groupby(df['date']).first().fillna('').reset_index()
    

    Alternatively you can mask the empty strings in the column value and assign it to temporary column key, then sort the dataframe on columns date and key, followed by drop_duplicates:

    df['key'] = df['value'].mask(df['value'].eq(''))
    df.sort_values(['date', 'key']).drop_duplicates('date').drop('key', 1)
    

    Result:

                      date value
    0  2012-10-12 10:10:10   123
    1  2012-10-19 10:55:10   324
    2  2012-11-02 16:08:07      
    3  2012-12-12 23:45:21   321