Search code examples
rubyperformancealgorithmlevenshtein-distancefuzzy-search

Fast fuzzy/approximate search in dictionary of strings in Ruby


I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some "edit" distance tolerance. (Levenshtein for example). I am fine pre-computing any type of data structure before doing the search.

My goal to run thousands of strings against that dictionary as fast as possible and returns the closest neighbor. I would be fine just getting a boolean that say whether a given is in the dictionary or not if there was a significantly faster algorithm to do so

For this, I first tried to compute all the Levenshtein distances and take the minimum and it was obviously horribly slow. So I tried to implement a Levenshtein Trie based on this article http://stevehanov.ca/blog/index.php?id=114

See my gist here for reproducing the benchmark: https://gist.github.com/nicolasmeunier/7493947

Here are a few benchmarks I got on my machine:

Edit distance of 0 (perfect match)

Benchmark.measure { 10.times { dictionary.search(random_word, 0) } }
<Benchmark::Tms:0x007fa59bad8908 @label="", @real=0.010889, @cstime=0.0, @cutime=0.0, @stime=0.0, @utime=0.00999999999999801, @total=0.00999999999999801> 

*Edit distance of 2, it becomes a LOT slower *

Benchmark.measure { 10.times { dictionary.search(random_word, 2) } }
<Benchmark::Tms:0x007fa58c9ca778 @label="", @real=3.404604, @cstime=0.0, @cutime=0.0, @stime=0.020000000000000018, @utime=3.3900000000000006, @total=3.4100000000000006>

And it goes downhill from there and become extremely slow for edit distance larger than 2. (1+ second on average per tested string).

I would like to know how/if I could speed this up significantly. If there are existing solutions already implemented in ruby/gems, I also don't want to reinvent the wheel...

EDIT 1: In my case, I expect most of the strings I am matching against the dictionary NOT to be in there. So if there are any algorithm to quickly discard a string, that could really help.

Thanks, Nicolas


Solution

  • I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching. Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

    Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

    • first ask either gem for a set of best matches trigrams-wise
    • then compare results with your input string, using Levenstein
    • and return the min for that measure.

    In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

    MyRecords.find_by_fuzzy_name(input_string).select { |result|
      Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
    }
    

    This performas a handful of well optimized database queries and a few

    Caveats:

    • if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
    • using trigrams assumes your input text is latin text or close to (european languages basically).
    • there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".