Search code examples
algorithmduplicatesfuzzy-logicrecord-linkage

Data Deduplication algorithm for large number of contacts


I'm developing an application which must be able to find & merge duplicates in a Hundreds of thousands of contact information stored in sql server DB. I have to compare all the columns in the table, each column has a weight value. The comparison must work based on the weight value. Based on the comparison result & degree of equivalence i have to decide to merge the contacts automatically or request user attention. I know that there are number of fuzzy logic algorithms for deduplication.

Read about N-gram or Q-gram-based Algorithms in http://www.melissadata.com/. Is this algorithm feasible for a large set of data? If not can any one guide me with some algorithm or tel me where to start with?

An example of what i want to achieve,

Gonzales = Gonzalez (two different spelling of different name)
Smith = Smyth (Phonetic sound the same)
123 Main st = 123 Main street (abbrevation)
Bob Smith = Robert Smith (synonym)

Solution

  • Found a partial solution using simhash algorithm. Found a good example here http://simhash.codeplex.com/