Search code examples
pythondataframeduplicatesdistinct-values

Remove Duplicates values in a Panda's Record


I want to remove duplicates in each row for the column animals.

I need something like this post, but in python. I cannot figure this out right now for some reason and I am hitting a block.

Remove duplicate records in dataframe

I have tried using drop duplicates, unique, nunique, etc. No luck.

df.drop_duplicates(subset=None, keep="first", inplace=False) df


df = pd.DataFrame ({'animals':['pink pig, pink pig, pink pig','brown cow, brown cow','pink pig, black cow','brown horse, pink pig, brown cow, black cow, brown cow']})

#input:
    animals
0   pink pig, pink pig, pink pig
1   brown cow, brown cow
2   pink pig, black cow
3   brown horse, pink pig, brown cow, black cow, brown cow

#I would like the output to look like this:
    animals
0   pink pig
1   brown cow
2   pink pig, black cow
3   brown horse, pink pig, brown cow, black cow


Solution

  • This does it:

    df = pd.DataFrame ({'animals':['pink pig, pink pig, pink pig','brown cow, brown cow','pink pig, black cow','brown horse, pink pig, brown cow, black cow, brown cow']})
    
    
    df['animals2'] = df.animals.apply(lambda x: ', '.join(list(set(x.split(', ')))))
    

    Output:

    0                                       pink pig
    1                                      brown cow
    2                            pink pig, black cow
    3    brown cow, brown horse, pink pig, black cow
    

    Explanation:

    I turned your strings into a list. Then I turned the list into a set to remove duplicates. Then I turned the set into a list, and the I split the list turning it into a string again. Please tell me if something isn't clear!