Suppose you have two lists of strings containing similar items, with changes (eg. List 1: Apples,fruits_b,orange; List2: Fruit,apples,banana,orange_juice).
Given a distance metric such as the Levenshtein distance, what are good algorithms for finding the optimal pairing, that is the pairing that minimizes the sum of distances for all pairings?
The result corresponding to my example would be:
Apples - apples
fruits_b - Fruit
orange - orange_juice
- banana
Subsidiary question: is there some tool that already implements this or something similar?
OK, here's my python solution using the Levenshtein distance and the Hungarian algorithm (both provided by external packages):
from munkres import Munkres
from Levenshtein import distance
from sys import argv
if __name__ == '__main__':
if len(argv) < 3:
print("Usage: fuzzy_match.py file file")
print("Finds the best pairing of lines from the two input files")
print("using the Levenshtein distance and the Hungarian algorithm")
w1 = [l.strip() for l in open(argv[1]).readlines()]
w2 = [l.strip() for l in open(argv[2]).readlines()]
if len(w1) != len(w2):
if len(w2) > len(w1):
w1, w2 = w2, w1
w2.extend([""]*(len(w1)-len(w2)))
matrix = []
for i in w1:
row = []
for j in w2:
row.append(distance(i.lower(), j.lower()))
matrix.append(row)
m = Munkres()
max_length = max(len(w) for w in w1)
for i, j in m.compute(matrix):
print(("{:<%d}{}" % (max_length+10)).format(w1[i], w2[j]))
It works pretty nicely. I'm still curious if anyone can come up with a better algorithm, though!