Search code examples
phpc++algorithmemailsimilarity

Similar Data Algorithm


I have a couple DBs of user information, each one 10k-20k entries, each one from a couple of different sources, and each one constantly growing. I'm looking to create a tool that can within a certain tolerance notice similar emails, or similar names ( first + ' ' + last ). I'm running a MySQL database, and can work with either C++ or PHP to run the comparison. Can anyone suggest any existing solutions / tutorials that would allow me to just run a check against the database or an array of data and return possible duplicates? I'd just want it to pick up a few common mistakes like these:

[email protected] <> [email protected] <> [email protected]
Josh O <> josh t O <> Joshua O

Maybe have the tolerance adjustable to within a certain amount of characters difference between the entries? Thanks you very, very much for any advice or solutions, I've not had much success Googling.


Solution

  • That would depend on your notion of "similarity". If you are looking for the number of characters that must be inserted, deleted or replaced in order to transform one string into another, the algorithm is called Levenshtein distance. Be warned, though, that it is quite slow for long strings (as each comparison uses a number of operations that is proportional to mn, where m and n are the lengths of the strings being compared), but if your data is email addresses and other short strings, you should be fine (and your biggest problem would be the number of comparisons, as you would need to compare each pair of strings to each other).