Search code examples
calgorithmsqlitesearchfuzzy-search

Detect duplicate names via fuzzy matching


I have a SQLite database which has (user_id, name). I want to detect if a user is already in the system by name. The problem is that the name is coming from a user meaning that he can misspell the name or it may be an alternate version of the name: "Tim" vs "Timothy". So I would like a function which finds the closest match to the input and gives a confidence of similarity to determine whether there is a match at all. The confidence should be between 0 and 1 (so that I can set a meaningful cutoff).

Table:

1 | Tim Best
2 | Roger Thomas
3 | Roper Bar
  • If the user enters Timothy Bert the function should return 1 | Tim Best | 0.8 (0.8 being the confidence, if that were what it happened to be).
  • If the user enters Roper Thomas the function should return 2 | Roger Thomas | 0.6
  • If the user enters Tim Taylor the function should return 1 | Tim Best | 0.3
  • If the user enters Foo Taylor the function should return 2 | Roper Thomas | 0.0

Ideally it would be best if I could write a query in SQLite to do this, but if thats not possible, I'll take a c solution as well.


Solution

  • There are several attempts to solve fuzzy string matching. Google tells you a lot and so does wikipedia. The most popular is Levenshtein. Other interesting approaches are Jaro-Winler and Trigram matching.

    My personal experience says that you have to play around with the algorithms which exist. I had a problem to match "FirstName LastName" vs. "LastName, FirstName" and the only algorithm suitable for my needs was a modified Trigram which I developed from the links provided.

    For your needs, you should also keep a dictionary of Name abbrevations so that you can convert each short form to its basic name and then do a fuzzy compare. However, this will most likely fail, as e.g. "Tin Taylor" where 'Tin' is misspelled 'Tim' will not lead to 'Timothy Taylor'.

    In order to cover that, you will need a lookup that can 'learn', i.e. is edited by some human.