Simple set-up: I have a list (roughly 40,000 entries) containing lists of strings (each with 2-15 elements). I want to compare all of the sublists to check if they have a common element (they share at most one). At the end, I want to create a dictionary (graph if you wish) where the index of each sublist is used as a key, and its values are the indices of the other sublists with which it shares common elements.
For example
lst = [['dam', 'aam','adm', 'ada', 'adam'], ['va','ea','ev','eva'], ['va','aa','av','ava']]
should give the following:
dic = {0: [], 1: [2], 2: [1]}
My problem is that I found a solution, but it's very computationally expensive. First, I wrote a function to compute the intersection of two lists:
def intersection(lst1, lst2):
temp = set(lst2)
lst3 = [value for value in lst1 if value in temp]
return lst3
Then I would loop over all the lists to check for intersections:
dic = {}
iter_range = range(len(lst))
#loop over all lists where k != i
for i in iter_range:
#create range that doesn't contain i
new_range = list(iter_range)
new_range.remove(i)
lst = []
for k in new_range:
#check if the lists at position i and k intersect
if len(intersection(mod_names[i], mod_names[k])) > 0:
lst.append(k)
# fill dictionary
dic[i] = lst
I know that for loops are slow, and that I'm looping over the list unnecessarily often (in the above example, I compare 1 with 2, then 2 with 1), but I don't know how to change it to make the program run faster.
You can create a dict word_occurs_in
which will store data which word occurs in which lists, for your sample that would be:
{'dam': [0], 'aam': [0], 'adm': [0], 'ada': [0], 'adam': [0], 'va': [1, 2], 'ea': [1], 'ev': [1], 'eva': [1], 'aa': [2], 'av': [2], 'ava': [2]}
Then you can create a new dict, let's call it result
, in which you should store the final result, e.g. {0: [], 1: [2], 2: [1]}
in your case.
Now, to get result
from word_occurs_in
, you should traverse the values of word_occurs_in
and see if the list has more then one element. If it does, then you just need add all other values except the value of the currently observed key in result
. For instance, when checking the value [1, 2]
(for key 'va'
), you' will add 1
to the value corresponding to 2
in the result
dict and will add 2
to the value corresponding to key 1
. I hope this helps.
In my understanding, the biggest complexity to your code comes from iterating the list of 40K entries twice, so this approach iterates the list only once, but uses a bit more space.
Maybe I didn't explain myself sufficiently, so here is the code:
from collections import defaultdict
lst = [['dam', 'aam', 'adm', 'ada', 'adam'], ['va', 'ea', 'ev', 'eva'], ['va', 'aa', 'av', 'ava']]
word_occurs_in = defaultdict(list)
for idx, l in enumerate(lst):
for i in l:
word_occurs_in[i].append(idx)
print(word_occurs_in)
result = defaultdict(list)
for v in word_occurs_in.values():
if len(v) > 1:
for j in v:
result[j].extend([k for k in v if k != j])
print(result)