Search code examples
pythonmatchingstring-matchingword-frequencyrecord-linkage

String matching algorithm with more weight given to more unique words?


I've been looking into Python and using the record.linkage toolkit for address matching. I've found the string matching algorithms such as levenshtein are returning false matches for very common addresses. Ideally an address with one very unique word matching would be more highly scored than very common words, e.g. "12 Pellican street" and "12 Pellican road" is a better match than "20 Main Street" and "34 Main Street".

Is there a method for incorporating a weighted string matching, so that addresses with more unique words carrying more importance for matching?


Solution

  • I've found using the qgram distance over the levenshtein distance take into consideration the frequency of the string in the dataset.