Search code examples
pythonpandasduplicatesdata-manipulationdata-cleaning

Working through duplicates along rows in DataFrame and deleting all except the last one in Python Pandas


I am miserably stuck at Pandas Data Cleaning. I have made simple example to demonstrate my problem. For each row, I want to delete the duplicates and keep the last one. Currently, my DataFrame is 'animals'. And I want it to be the DataFrame 'animals_clean'

Imagine this DataFrame. You can see duplicates along axis=0, e.g. 'cat' is repeated in row 0

list_of_animals = [['cat','dog','monkey','sparrow', 'cat'],['cow', 'eagle','rat', 'eagle', 'owl'],['deer', 'horse', 'goat', 'falcon', 'falcon']]
animals = pd.DataFrame(list_of_animals)

How it looks:

Click here! This is how it looks

This is the result I want. You can see the duplicates in each row is marked 'X' keeping the last one.

list_of_animals_clean = [['X','dog','monkey','sparrow', 'cat'],['cow', 'X','rat', 'eagle', 'owl'], ['deer', 'horse', 'goat', 'X', 'falcon']]
animals_clean = pd.DataFrame(list_of_animals_clean)

Should look like:

Click here! This is how it should look like


Solution

  • Try apply + mask + duplicated with keep='last':

    import pandas as pd
    
    list_of_animals = [['cat', 'dog', 'monkey', 'sparrow', 'cat'],
                       ['cow', 'eagle', 'rat', 'eagle', 'owl'],
                       ['deer', 'horse', 'goat', 'falcon', 'falcon']]
    animals = pd.DataFrame(list_of_animals)
    
    animals = animals.apply(
        lambda s: s.mask(s.duplicated(keep='last'), 'x'),
        axis=1
    )
    
    print(animals)
    

    Output:

          0      1       2        3       4
    0     x    dog  monkey  sparrow     cat
    1   cow      x     rat    eagle     owl
    2  deer  horse    goat        x  falcon