Search code examples
phpmysqlsearchfull-text-searchsearch-engine

Identifying (non exact) duplicates from a MySQL database


Is there any tools for identifying, and merging non exact duplicates in MySQL tables?

I have a large data set with many duplicates like:

1348,  Auto Motors, 12 Long Road, etc
48264, Auto Mtors,  12 Log Road,  etc
82743, Ato Motoers, 12 Lng Road,  etc
83821, Auto Motors, 13 Long Road, etc
92743, Auto Motors, 11 Long Road, etc

There are many tables needed to be merged like:

  • Companies
  • Addresses
  • Phone Numbers
  • Employees

There is about 100,000 rows, and 30-40 columns to match on each row (joined tables).

So, anyone know of a tool for sorting this out? I already have MySQL, PHP installed. I have/can use(d) MongoDB, and Solr before if they would help. And I am open to installing other software if needed.


Alternatively what kind of queries should I run if I cannot find a tool to handle this.

A simple find all duplicates wont work because they are not exact.

Doing wildcard like searches would be extremely slow for all the different combinations I would need to try.

Using a Oliver or Levenshtein (MySQL) may work, and there is too much data to pull into PHP (also probably extremely slow).


Solution

  • You have data that requires massaging. I don't think this is something you can do entirely in sql.

    Google Refine is a great tool for massaging. I would load the data in Refine first, clean it up, then import into your relational database.