Search code examples
pythonlistfunctioncomparelevenshtein-distance

Compare multiple Python lists and merge on Levenshtein similarity


I have written a Python function that takes two lists, compares them using Levenshtein and merges words that are similar enough, into a list called 'merged'.

How can I do this for 6+ lists? Making sure that each list is compared to the 5 other lists and so on?

first_list = ["Mouse", "Cat", "Dog", "Gremlinge", "Horse"]
second_list = ["Mouse", "Cat", "Hors", "Dog", "Gremling"]
third_list = ["Mouse", "Cat", "Horrs", "Dog", "Greemling"]
fourth_list = ["Mouse", "Cate", "Dog", "Gremlinge", "Horse"]
fifth_list = ["Mose", "Cat", "Hors", "Dog", "Gremling"]
sixth_list = ["Mouse", "Cat", "Horser", "Doeg", "Gremling"]

def lev_merging(a, b): # function to compare 2 lists
  merged = [] # Empty list to add the matching words
  for first in a:
    for second in b:
      if levenshtein(first, second) < 2:
        merged.append(set([first,second]))
  return merged

print (lev_merging(first_list,second_list))

Working www.repl.it fiddle of code.


Solution

  • We'll have a list of lists of strings

    list_of_lists = [["Mouse", "Cat", "Dog", "Gremlinge", "Horse"],
                      ["Mouse", "Cat", "Hors", "Dog", "Gremling"],
                      ["Mouse", "Cat", "Horrs", "Dog", "Greemling"],
                      ["Mouse", "Cate", "Dog", "Gremlinge", "Horse"],
                      ["Mose", "Cat", "Hors", "Dog", "Gremling"],
                      ["Mouse", "Cat", "Horser", "Doeg", "Gremling"]]
    

    Then we'll iterate through this list, keeping track of the index of the list we are "in", and compare this list to all the lists that come after it.

    def merging(list_of_lists):
        merged = []
        for i, a in enumerate(list_of_lists):
            for b in list_of_lists[i+1:]:
                for first in a:
                    for second in b:
                        if lev(first, second) < 2:
                            merged.append((first, second))
        return merged
    

    EDIT: The below code passes pairs of lists into a function, and separates them into groups. Then we will process each of those groups into sets, to remove duplicates.

    target_num_words = 6
    target_num_words
    
    def merging(list_of_lists):
        groups = []
        for i, a in enumerate(list_of_lists):
            for b in list_of_lists[i+1:]:
                if number_of_matches(a, b) >= target_num_words:
                    for g in groups:
                        if a in g or b in g:
                            g.append(a if b in g else b)
                            break
                    else:
                        groups.append([a, b])
        merged = []
        for g in groups:
            if len(g) >= target_num_lists:
                merged.append({x for l in g for x in l})
        return merged
    

    number_of_matches is basically your Levenshtein code, except it just returns that number of matching words there are between two lists. Even if this isn't exactly what you want, this should give you some idea of how to get there.