Search code examples
pythonpandasdataframedata-analysisdata-science

Python/Pandas - Delete duplicate rows by column value


I have DataFrame like this:

     sale_id          dt        receipts_qty 
31     196.0  2017-02-19                95.0    
32     203.0  2017-02-20               101.0   
33     196.0  2017-02-21               105.0            
34     196.0  2017-02-22               112.0           
35     196.0  2017-02-23               118.0           
36     196.0  2017-02-24               135.0            
37     196.0  2017-02-25               135.0           
38     196.0  2017-02-26               124.0           
40     203.0  2017-02-27               290.0          
39     196.0  2017-02-27                84.0          
42     203.0  2017-02-28               330.0            
41     196.0  2017-02-28               124.0           
43     196.0  2017-03-01               100.0          
44     203.0  2017-03-01               361.0         

I have to drop duplicates by dt and keep the rows where sale_id == 196. I have found only drop_duplicates('dt', keep='last') and drop_duplicates('dt', keep='first') but it isn't what I need.

DataFrame I want to get:

     sale_id          dt        receipts_qty  
31     196.0  2017-02-19                95.0   
32     203.0  2017-02-20               101.0       
33     196.0  2017-02-21               105.0            
34     196.0  2017-02-22               112.0           
35     196.0  2017-02-23               118.0           
36     196.0  2017-02-24               135.0            
37     196.0  2017-02-25               135.0           
38     196.0  2017-02-26               124.0                 
39     196.0  2017-02-27                84.0                     
41     196.0  2017-02-28               124.0           
43     196.0  2017-03-01               100.0          

Solution

  • Create helper column first for first value by condition, then sort_values and drop_duplicates.

    Last cleaning - remove column a and sort_index:

    print (df)
        sale_id          dt  receipts_qty
    31    196.0  2017-02-19          95.0
    32    203.0  2017-02-20         101.0
    33    196.0  2017-02-21         105.0
    34    196.0  2017-02-22         112.0
    35    196.0  2017-02-23         118.0
    36    196.0  2017-02-24         135.0
    37    196.0  2017-02-25         135.0
    38    196.0  2017-02-26         124.0
    40    203.0  2017-02-27         290.0
    39    196.0  2017-02-27          84.0
    42    103.0  2017-02-28         330.0 <-changed data, value < 196
    41    196.0  2017-02-28         124.0
    43    196.0  2017-03-01         100.0
    44    203.0  2017-03-01         361.0
    

    #get only values > 196 
    df['a'] = (df.sale_id == 196).astype(int)
    #sorting by new column, remove duplicates, remove helper column
    df['a'] = (df.sale_id == 196).astype(int)
    df = (df.sort_values(['a','dt'], ascending=[False, True])
           .drop_duplicates('dt')
           .drop('a', axis=1)
           .sort_index())
    print (df)
        sale_id          dt  receipts_qty
    31    196.0  2017-02-19          95.0
    32    203.0  2017-02-20         101.0
    33    196.0  2017-02-21         105.0
    34    196.0  2017-02-22         112.0
    35    196.0  2017-02-23         118.0
    36    196.0  2017-02-24         135.0
    37    196.0  2017-02-25         135.0
    38    196.0  2017-02-26         124.0
    39    196.0  2017-02-27          84.0
    41    196.0  2017-02-28         124.0
    43    196.0  2017-03-01         100.0