Search code examples
pythonarrayslistcomparehashable

Complexity reduction: Find common elements in lists


Simple set-up: I have a list (roughly 40,000 entries) containing lists of strings (each with 2-15 elements). I want to compare all of the sublists to check if they have a common element (they share at most one). At the end, I want to create a dictionary (graph if you wish) where the index of each sublist is used as a key, and its values are the indices of the other sublists with which it shares common elements.

For example

lst = [['dam', 'aam','adm', 'ada', 'adam'], ['va','ea','ev','eva'], ['va','aa','av','ava']]

should give the following:

dic = {0: [], 1: [2], 2: [1]}

My problem is that I found a solution, but it's very computationally expensive. First, I wrote a function to compute the intersection of two lists:

def intersection(lst1, lst2): 

    temp = set(lst2) 
    lst3 = [value for value in lst1 if value in temp] 
    return lst3 

Then I would loop over all the lists to check for intersections:

dic = {}
iter_range = range(len(lst))

#loop over all lists where k != i
for i in iter_range:

    #create range that doesn't contain i
    new_range = list(iter_range)
    new_range.remove(i)

    lst = []

    for k in new_range:
        #check if the lists at position i and k intersect
        if len(intersection(mod_names[i], mod_names[k])) > 0:
            lst.append(k)

    # fill dictionary 
    dic[i] = lst

I know that for loops are slow, and that I'm looping over the list unnecessarily often (in the above example, I compare 1 with 2, then 2 with 1), but I don't know how to change it to make the program run faster.


Solution

  • You can create a dict word_occurs_in which will store data which word occurs in which lists, for your sample that would be:

    {'dam': [0], 'aam': [0], 'adm': [0], 'ada': [0], 'adam': [0], 'va': [1, 2], 'ea': [1], 'ev': [1], 'eva': [1], 'aa': [2], 'av': [2], 'ava': [2]}

    Then you can create a new dict, let's call it result, in which you should store the final result, e.g. {0: [], 1: [2], 2: [1]} in your case.

    Now, to get result from word_occurs_in, you should traverse the values of word_occurs_in and see if the list has more then one element. If it does, then you just need add all other values except the value of the currently observed key in result. For instance, when checking the value [1, 2] (for key 'va'), you' will add 1 to the value corresponding to 2 in the result dict and will add 2 to the value corresponding to key 1. I hope this helps.

    In my understanding, the biggest complexity to your code comes from iterating the list of 40K entries twice, so this approach iterates the list only once, but uses a bit more space.

    Maybe I didn't explain myself sufficiently, so here is the code:

    from collections import defaultdict
    
    lst = [['dam', 'aam', 'adm', 'ada', 'adam'], ['va', 'ea', 'ev', 'eva'], ['va', 'aa', 'av', 'ava']]
    
    word_occurs_in = defaultdict(list)
    
    for idx, l in enumerate(lst):
        for i in l:
            word_occurs_in[i].append(idx)
    
    print(word_occurs_in)
    
    result = defaultdict(list)
    for v in word_occurs_in.values():
        if len(v) > 1:
            for j in v:
                result[j].extend([k for k in v if k != j])
    
    print(result)