Search code examples
regexdistancelevenshtein-distance

Is it possible to calucate the edit distance between a regexp and a string?


If so, please explain how.

Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."

For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.

If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.

How can you find the shortest distance between a regexp and a non-matching string?

The above is the Damerau–Levenshtein distance algorithm.


Solution

  • You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.

    If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.