string algorithm performance match fuzzy-comparison

Fast way to match strings with typo

I have a huge list of strings (city-names) and I want to find the name of a city even if the user makes a typo.

Example

User types "chcago" and the system finds "Chicago"

Of course I could calculate the Levenshtein distance of the query for all strings in the list but that would be horribly slow.

Is there any efficient way to perform this kind of string-matching?

Solution

I think the basic idea is to use Levenshtein distance, but on a subset of the names. One approach that works if the names are long enough is to use n-grams. You can store n-grams and then use more efficient techniques to say that at least x n-grams need to match. Alas, your example misspelling has 2-matching 3-grams with Chicago out of 5 (unless you count partials at the beginning and end).

For shorter names, another approach is to store the letters in each name. So, "Chicago" would turn into 6 "tuples": "c", "h", "i", "a", "g", "o". You would do the same for the name entered and then require that 4 or 5 match. This is a fairly simple match operation, so it can go quite fast.

Then, on this reduced set, apply Levenshtein distance to determine what the closest match is.