Search code examples
pythonpandaslistdataframenested-lists

How to remove duplicates from list of lists which is in pandas data frame


I have below data frame. I want to compare two columns which have list of lists and remove duplicates and then combine both into one. I am trying the below logic but it throws a error "TypeError: unhashable type: 'list'".

data frame:-

df  = pd.DataFrame({'col1':[[[1452, 5099], [1418, 499]], [[1427, 55099]]],
                     'col2':[[[1452, 5099], [1417, 490]], [[1317, 55010]]]})
df
         col1                                    col2
0   [[1452, 5099], [1418, 499]]       [[1452, 5099], [1417, 490]]
1   [[1427, 55099]]                   [[1317, 55010]]

res =  [list(set(l1).union(l2) - set(l1).intersection(l2)) for l1, l2 in zip(df['col1'].tolist(), df['col2'].tolist())]
print(res)

Error:

TypeError: unhashable type: 'list'

Excepted output:-

res = [[[1452, 5099], [1418, 499],[1417, 490]], [[1427, 55099],[1317, 55010]]]
df['result']=res
print(df)
            col1                                  col2                   result
    0   [[1452, 5099], [1418, 499]]   [[1452, 5099], [1417, 490]]    [[1452, 5099], [1418, 499],[1417, 490]
    1   [[1427, 55099]]               [[1317, 55010]]                [[1427, 55099],[1317, 55010]

Solution

  • You need to temporarily convert your lists to tuples to be hashable.

    The cleanest is probably to use a helper function:

    def merge(list_of_lists):
        seen = set()
        out = []
        for l in list_of_lists:
            for item in l:
                t = tuple(item)
                if t not in seen:
                    out.append(item)
                    seen.add(t)
        return out
    
    df['result'] = [merge(l) for l in zip(df['col1'], df['col2'])]
    

    A more hacky and less readable way would be to use an intermediate dictionary as container:

    df['result'] = [list({tuple(x): x for l in lst for x in l}.values())
                    for lst in zip(df['col1'], df['col2'])]
    

    output:

                              col1                         col2                                    result
    0  [[1452, 5099], [1418, 499]]  [[1452, 5099], [1417, 490]]  [[1452, 5099], [1418, 499], [1417, 490]]
    1              [[1427, 55099]]              [[1317, 55010]]            [[1427, 55099], [1317, 55010]]