Search code examples
pythonstringalgorithmlevenshtein-distanceedit-distance

String similarity metrics in Python


I want to find string similarity between two strings. en.wikipedia has examples of some of them. code.google has a Python implementation of Levenshtein distance.
Is there a better algorithm, (and hopefully a Python library), under these constraints:

  1. I want to do fuzzy matches between strings. eg matches('Hello, All you people', 'hello, all You peopl') should return True
  2. False negatives are acceptable, False positives, except in extremely rare cases are not.
  3. This is done in a non realtime setting, so speed is not (much) of concern.
  4. [Edit] I am comparing multi word strings.

Would something other than Levenshtein distance(or Levenshtein ratio) be a better algorithm for my case?


Solution

  • There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

    http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

    Here's a bit of the list:

    • Hamming distance
    • Levenshtein distance
    • Needleman-Wunch distance or Sellers Algorithm
    • and many more...