Search code examples
pythonpandasstringdataframeunique

Filter dataframe by removing duplicates from column containing list pandas


Dataframe columns contains string values in list. Dataframe needs to be transformed to have rows with unique lists in column 'Final'

I have dataframe as below,

    string1           string2           Final
1   [abc,ncx]       [qwe, rty]        [apple, mango]
2   [uio,pas,dfg]   [zxc,vbg,dfv]     [banana,grapes, apple]
3   [ncx,abc]       [rty,qwe]         [mango,apple]
4   [uio,pas,dfg]   [zxc,vbg,dfv]     [banana,grapes, apple]
5   [uio,dfg]        [zxc,dfv]        [banana, apple]
6   [ncx,abc]       [rty,qwe]         [mango,apple]

df['final'] column must drop duplicate lists and transform dataframe to contain unique list in 'final' column.

Desired output dataframe:

     string1           string2           Final
1   [abc,ncx]       [qwe, rty]        [apple, mango]
2   [uio,pas,dfg]   [zxc,vbg,dfv]     [banana,grapes, apple]
3   [ncx,abc]       [rty,qwe]         [mango,apple]
4   [uio,dfg]        [zxc,dfv]        [banana, apple]

Solution

  • Invert mask by ~ created by Series.duplicated, but because lists are not hashable first convert them to tuples and filter in boolean indexing:

    df = df[~df['Final'].apply(tuple).duplicated()]
    print (df)
             string1        string2                    Final
    1      [abc,ncx]      [qwe,rty]           [apple, mango]
    2  [uio,pas,dfg]  [zxc,vbg,dfv]  [banana, grapes, apple]
    3      [ncx,abc]      [rty,qwe]           [mango, apple]
    5      [uio,dfg]      [zxc,dfv]          [banana, apple]
    

    If apple, mango should be duplicate with mango, apple (order is not important) change tuple to frozenset:

    df = df[~df['Final'].apply(frozenset).duplicated()]
    print (df)
             string1        string2                    Final
    1      [abc,ncx]      [qwe,rty]           [apple, mango]
    2  [uio,pas,dfg]  [zxc,vbg,dfv]  [banana, grapes, apple]
    5      [uio,dfg]      [zxc,dfv]          [banana, apple]