Search code examples
pythonlistoptimizationmathematical-optimizationenumeration

Efficient deduplication in Python


I have coded a little code who attribute, to each element of a list, a score... To do this, I need to do this (simplified code):

group={1:["Jack", "Jones", "Mike"],
       2:["Leo", "Theo", "Jones", "Leo"],
       3:["Tom", "Jack"]}

already_chose=["Tom","Mike"]
result=[]

for group_id in group:
    name_list = group[group_id]
    y=0;x=0
    repeat=[]
    for name in name_list:
        if name in already_chose:
            y+=1
        elif name not in repeat:
            x+=1
            repeat.append(name)
    score_group=x-y
    result.append([group_id,score_group])

output: [[1, 1], [2, 3], [3, 0]]

The issue is, if you read this code, that it's not optimized to a big enumeration (more than 7000 groups and 100 names by groups)...

I hope someone can help me ? Thanks a lot


Solution

  • IIUC, you want to get the length of the set of the unique names not in already_chose minus the number of names in already_chose.

    This is easily achieved with python sets and a list comprehension. The advantage in using python sets, is that operations are very fast due to hashing of the elements.

    [[k, len(set(v).difference(already_chose))-len(set(v).intersection(already_chose))]
     for k,v in group.items()]
    

    output: [[1, 1], [2, 3], [3, 0]]

    NB. might be more useful as dictionary comprehension:

    {k: len(set(v).difference(already_chose))-len(set(v).intersection(already_chose))
     for k,v in group.items()}
    

    output: {1: 1, 2: 3, 3: 0}