Efficient way to modify a dictionary while comparing its items

I have a dictionary with strings as keys and sets as values. These sets contain integers, which may be common for different items of the dictionary. Indeed, my goal is to compare each item against the rest and find those common numbers in the set values. Once located a pair of items that at least share a number in their value sets (lets say, (key1, val1) and (key2, val2)), I want to add an extra item to the dictionary whose key is the concatenation of both keys key1 and key2 (I DO NOT CARE ABOUT THE ORDER), and whose value is a set with the common numbers in val1 and val2. Additionally, the common numbers in val1 and val2 should be removed from that item, and in the eventual case those sets are left empty, that item should be removed.

Long description, example needed:

I have a dictionary like this one:

db = {"a": {1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}

I am looking for a procedure that returns something like this:

{"ab": {1}, "c":{4}, "cd": {2,3}}

I actually have a solution, but I suspect is quite inefficient, especially because I start creating two lists for the keys and the values of the dictionary, in order to be able to do a nested loop. That is,

db = {"a":{1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}

keys = list(db.keys())
values = list(db.values())
blackList = set() #I plan to add here the indexes to be removed
for i, key in enumerate(keys):
    if i in blackList: continue
    keysToAdd = []
    valuesToAdd = []
    for j in range(i+1,len(keys)):
        if j in blackList: continue
        
        intersect = values[i].intersection(values[j])

        if len(intersect)==0: continue
        
        keysToAdd.append(key+keys[j]) #I don't care about the order
        valuesToAdd.append(intersect)

        values[i] -= intersect
        values[j] -= intersect

        if len(values[j])==0:
            blackList.add(j)

        if len(values[i])==0:
            blackList.add(i)
            break
    #out of j loop
    keys.extend(keysToAdd)
    values.extend(valuesToAdd)

#finally delete elements in the blackList
for i in sorted(blackList, reverse=True):
    del keys[i]
    del values[i]

print(dict(zip(keys, values))) #{"ab": {1}, "c":{4}, "cd":{2,3}} #I don't care about the order

I would like to know if my approach can be improved, in order to have a more efficient solution (because I plan to have big and complex input dicts). Is it actually a way of using itertools.combinations? - Or perhaps a completely different approach to solve this problem. Thanks in advance

Solution

I think your approach is not overly inefficient, but there is at least one slightly faster way to achieve this. I think this is a good job for defaultdict.

from collections import defaultdict

def deduplicate(db: dict) -> dict:
    # Collect all keys which each number appears in by inverting the mapping:
    inverse = defaultdict(set)  # number -> keys
    for key, value in db.items():
        for number in value:
            inverse[number].add(key)

    # Invert the mapping again - this time joining the keys:
    result = defaultdict(set)  # joined key -> numbers
    for number, keys in inverse.items():
        new_key = "".join(sorted(keys))
        result[new_key].add(number)
    return dict(result)

When comparing using your example - this is roughly 6 times faster. I don't doubt there are faster ways to achieve it!

Using itertools.combinations would be a reasonable approach if for example you knew that each number could show up in at most 2 keys. As this can presumably be larger - you would need to iterate combinations of varying sizes, ending up with more iterations overall.

Note 1: I have sorted the keys before joining into a new key. You mention that you don't mind about the order - but in this solution its important that keys are consistent, to avoid both "abc" and "bac" existing with different numbers. The consistent order will also make it easier for you to test.

Note 2: the step where you remove values from the blacklist has a side effect on the input db - and so running your code twice with the same input db would produce different results. You could avoid that in your original implementation by performing a deepcopy:

values = deepcopy(list(db.values())