Search code examples
stringalgorithmlevenshtein-distance

Algorithm for fuzzy pairing of strings from two lists


Suppose you have two lists of strings containing similar items, with changes (eg. List 1: Apples,fruits_b,orange; List2: Fruit,apples,banana,orange_juice).

Given a distance metric such as the Levenshtein distance, what are good algorithms for finding the optimal pairing, that is the pairing that minimizes the sum of distances for all pairings?

The result corresponding to my example would be:

Apples    - apples
fruits_b  - Fruit
orange    - orange_juice
          - banana

Subsidiary question: is there some tool that already implements this or something similar?


Solution

  • OK, here's my python solution using the Levenshtein distance and the Hungarian algorithm (both provided by external packages):

    from munkres import Munkres
    from Levenshtein import distance
    from sys import argv
    
    if __name__ == '__main__':
        if len(argv) < 3:
            print("Usage: fuzzy_match.py file file")
            print("Finds the best pairing of lines from the two input files")
            print("using the Levenshtein distance and the Hungarian algorithm")
        w1 = [l.strip() for l in open(argv[1]).readlines()]
        w2 = [l.strip() for l in open(argv[2]).readlines()]
        if len(w1) != len(w2):
            if len(w2) > len(w1):
                w1, w2 = w2, w1
            w2.extend([""]*(len(w1)-len(w2)))
        matrix = []
        for i in w1: 
            row = []
            for j in w2:
                row.append(distance(i.lower(), j.lower()))
            matrix.append(row)
        m = Munkres()
        max_length = max(len(w) for w in w1)
        for i, j in m.compute(matrix):
            print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))
    

    It works pretty nicely. I'm still curious if anyone can come up with a better algorithm, though!