Search code examples
algorithmcluster-analysisspell-checkinglevenshtein-distance

Algorithm for clustering names


I have people names (first name, last name and surname) in db column. The data is not full, for example some rows

  • have only first name, last name or surname.
  • are in different order (surname, last name)
  • incorrectly spelled

I need an algorithm to display a set of rows in a group, that will suggest that it is the same person and I will go and manually delete them except one.

This data is very specific and the names are NOT repeated, so if we have John, Jonh Smihtm and John Smith, this is the same person for sure and I will go and manually delete all except the last one.

I need to display them in likelihood groups. So there should be a group that is very very likely that is the same person(John Smith, Jonh Smit), then there should be a set that are likely the same person (John, Johnny), and a set that maybe the same person(Jo, Jonathan).

I am relatively new to data mining and clustering, so please advise me some algorithms and what to get started with.


Solution

  • Do not use clustering. It will produce a lot of false positives. It will consider “Sam” and “Pam” highly similar.

    Instead look at spelling correction, or define a Levenshtein distance threshold. But something that considers typo behavior will work even better than such a native letter approach .