Search code examples
pythonpandasdataframeduplicatesrow

How can I delete the rest duplicate rows while keeping the first and last row based on Column A?


How can I delete the rest duplicate rows while keeping the first and last row based on Column A?

 df = pd.DataFrame({
        'Column A': [12,12,12, 15, 16, 141, 141, 141, 141],
         'Column B':['Apple' ,'Apple' ,'Apple' , 'Red', 'Blue', 'Yellow', 'Yellow', 'Yellow', 'Yellow'],
        'Column C':[100, 50, np.nan , 23 , np.nan , 199 , np.nan , 1,np.nan]
    }) 

or data table as follows:


    | Column A | Column B |Column C 
----| -------- | ---------|--------
0   | 12       | Apple    |100     
1   | 12       | Apple    |50      
2   | 12       | Apple    |NaN      
3   | 15       | Red      |23       
4   | 16       | Blue     |NaN      
5   | 141      | Yellow   |199      
6   | 141      | Yellow   |NaN      
7   | 141      | Yellow   |1        
8   | 141      | Yellow   |NaN  


The result would be:


    | Column A | Column B |Column C 
----| -------- | ---------|--------
0   | 12       | Apple    |100         
2   | 12       | Apple    |NaN      
3   | 15       | Red      |23       
4   | 16       | Blue     |NaN      
5   | 141      | Yellow   |199           
8   | 141      | Yellow   |NaN  



Solution

  • This is a possible way to achieve what you want:

    result = (
        pd.concat([
            df.drop_duplicates('Column A', keep='first'),
            df.drop_duplicates('Column A', keep='last'),
        ]).reset_index()
          .drop_duplicates('index')
          .sort_values('index')
          .set_index('index')
          .rename_axis(None)
    )
    

    Result:

       Column A Column B  Column C
    0        12    Apple     100.0
    2        12    Apple       NaN
    3        15      Red      23.0
    4        16     Blue       NaN
    5       141   Yellow     199.0
    8       141   Yellow       NaN