Is there any tools for identifying, and merging non exact duplicates in MySQL tables?
I have a large data set with many duplicates like:
1348, Auto Motors, 12 Long Road, etc
48264, Auto Mtors, 12 Log Road, etc
82743, Ato Motoers, 12 Lng Road, etc
83821, Auto Motors, 13 Long Road, etc
92743, Auto Motors, 11 Long Road, etc
There are many tables needed to be merged like:
There is about 100,000 rows, and 30-40 columns to match on each row (joined tables).
So, anyone know of a tool for sorting this out? I already have MySQL, PHP installed. I have/can use(d) MongoDB, and Solr before if they would help. And I am open to installing other software if needed.
Alternatively what kind of queries should I run if I cannot find a tool to handle this.
A simple find all duplicates
wont work because they are not exact.
Doing wildcard like searches would be extremely slow for all the different combinations I would need to try.
Using a Oliver
or Levenshtein
(MySQL) may work, and there is too much data to pull into PHP (also probably extremely slow).
You have data that requires massaging. I don't think this is something you can do entirely in sql.
Google Refine is a great tool for massaging. I would load the data in Refine first, clean it up, then import into your relational database.