I have written a Python function that takes two lists, compares them using Levenshtein and merges words that are similar enough, into a list called 'merged'.
How can I do this for 6+ lists? Making sure that each list is compared to the 5 other lists and so on?
first_list = ["Mouse", "Cat", "Dog", "Gremlinge", "Horse"]
second_list = ["Mouse", "Cat", "Hors", "Dog", "Gremling"]
third_list = ["Mouse", "Cat", "Horrs", "Dog", "Greemling"]
fourth_list = ["Mouse", "Cate", "Dog", "Gremlinge", "Horse"]
fifth_list = ["Mose", "Cat", "Hors", "Dog", "Gremling"]
sixth_list = ["Mouse", "Cat", "Horser", "Doeg", "Gremling"]
def lev_merging(a, b): # function to compare 2 lists
merged = [] # Empty list to add the matching words
for first in a:
for second in b:
if levenshtein(first, second) < 2:
merged.append(set([first,second]))
return merged
print (lev_merging(first_list,second_list))
We'll have a list of lists of strings
list_of_lists = [["Mouse", "Cat", "Dog", "Gremlinge", "Horse"],
["Mouse", "Cat", "Hors", "Dog", "Gremling"],
["Mouse", "Cat", "Horrs", "Dog", "Greemling"],
["Mouse", "Cate", "Dog", "Gremlinge", "Horse"],
["Mose", "Cat", "Hors", "Dog", "Gremling"],
["Mouse", "Cat", "Horser", "Doeg", "Gremling"]]
Then we'll iterate through this list, keeping track of the index of the list we are "in", and compare this list to all the lists that come after it.
def merging(list_of_lists):
merged = []
for i, a in enumerate(list_of_lists):
for b in list_of_lists[i+1:]:
for first in a:
for second in b:
if lev(first, second) < 2:
merged.append((first, second))
return merged
EDIT: The below code passes pairs of lists into a function, and separates them into groups. Then we will process each of those groups into sets, to remove duplicates.
target_num_words = 6
target_num_words
def merging(list_of_lists):
groups = []
for i, a in enumerate(list_of_lists):
for b in list_of_lists[i+1:]:
if number_of_matches(a, b) >= target_num_words:
for g in groups:
if a in g or b in g:
g.append(a if b in g else b)
break
else:
groups.append([a, b])
merged = []
for g in groups:
if len(g) >= target_num_lists:
merged.append({x for l in g for x in l})
return merged
number_of_matches
is basically your Levenshtein code, except it just returns that number of matching words there are between two lists. Even if this isn't exactly what you want, this should give you some idea of how to get there.