Finding similar names better than O(n^2)

Say I've got a list of names. Unfortunately, there are some duplicates, but it isn't obvious which of those are duplicates.

Tom Riddle
Tom M. Riddle
tom riddle
Tom Riddle, PhD.

I'm thinking of using Levenshtein distance, and there are definitely other algorithms that do come to mind to compare 2 names at a time.

But in a list of names, regardless of the string distance algorithm, I'll always end up generating a grid of comparison outputs (n^2).

How can I avoid the O(n^2) situation?

Solution

Introduction

What you want to do is known as Fuzzy search. Let me guide you through the topic.

First, setup an inverted index (Wikipedia) of n-grams (Wikipedia). That is, split a word like "hello" into, for example 3-grams:

"$$h", "$he", "hel", "ell", "llo", "lo$", "o$$"

And have a map which maps every n-gram to a list of words that contain it:

"$$h" -> ["hello", "helloworld", "hi", "huhu", "hey"]
"$he" -> ["hello", "helloworld", "hey"]
...
"llo" -> ["hello", "helloworld", "llowaddup", "allo"]
...

All words in your database are now indexed by their n-grams. This is why it is called inverted index.

The idea is, given a query, to compute how many n-grams the query has in common with all words in your database. This can be computed fast. After that you can use this to skip computation of the expensive edit distance for a huge set of records. Which dramatically increases the speed. It is the standard approach that all search engines use (more or less).

Let me first explain the general approach by the example of an exact match. After that we will slightly modify it and get to the fuzzy matching.

Exact match

At query time, compute the n-grams of your query, fetch the lists and compute the intersection.

Like if you get "hello" you compute the grams and get:

"$$h", "$he", "hel", "ell", "llo", "lo$", "o$$"

You fetch all lists for all of those n-grams:

List result;
foreach (String nGram) in (query.getNGrams()) {
    List words = map.get(nGram);
    result = result.intersect(words);
}

The intersection contains all words which match exactly those grams, this is "hello" only.

Note that an exact match can be computed faster by using hashing, like a HashSet.

Fuzzy match

Instead of intersecting the lists, merge them. In order to merge efficiently you should use any k-way merge algorithm, it requires the list of words in your inverted index to be sorted prior though, so make sure to sort it at construction.

You now get a list of all words that have at least one n-gram in common with the query.

We already greatly reduced the set of possible records. But we can do even better. Maintain, for each word, the amount of how many n-grams it has in common with the query. You can easily do that while merging the lists.

Consider the following threshold:

max(|x|, |y|) - 1 - (delta - 1) * n

where x is your query, y the word candidate you are comparing against. n is the value for the n-grams you have used, 3 if 3-gram for example. delta is the value of how many mistakes you allow.

If the count is below that value, you directly know that the edit distance is

ED(x, y) > delta

So you only need to consider words with a count more than the above threshold. Only for those words you compute the edit distance ED(x, y).

By that we extremely reduced the set of possible candidates and only compute the expensive edit distance on a small amount of records.

Example

Suppose you get the query "hilari". Let's use 3-grams. We get

"$$h", "$hi", "hil", "ila", "lar", "ari", "ri$", "i$$"

We search through our inverted index, merge lists of words that have those grams in common and get "hillary", "haemophilia", "solar". Together with those words we counted how many grams they have in common:

"hillary"      -> 4 ("$$h", "hi", "hil", "lar")
"haemophilia"  -> 2 ("$$h", "hil")
"solar"        -> 1 ("lar")

Check each entry against the threshold. Let delta be 2. We get:

4 >= max(|"hilari"|, |"hillary"|) - 4     = 3
2 <  max(|"hilari"|, |"haemophilia"|) - 4 = 6
1 <  max(|"hilari"|, |"solar"|) - 4       = 2

Only "hillary" is above the threshold, discard the rest. Compute the edit distance for all remaining records:

ED("hilari", "hillary") = 2

Which is not beyond delta = 2, so we accept it.