Search code examples
fuzzy-comparisonrecord-linkagepython-dedupe

Use Python dedupe library to return all matches against messy dataset


First, if you haven't seen the Dedupe library for Python: it's awesome. Much like TensorFlow, it's a great way to bring machine learning to the masses (like me).

I'm trying to do record linkage of names against a single, large, messy data set. I'm using heuristics right now, and it's starting to fall short with more complicated data sets.

Questions:

Is there a way to perform a match of a single record (one-by-one or in batches) and return all the potential matches?

Gazetteer docs say one side must be clean, no duplicates. If names can be duplicated but serial numbers aren't (and serial numbers aren't used in matching) isn't that a duplicate?

Context:

There are 1.6M specialized construction machines in the US. There is a database with the machine type, owner names (up to two, companies included), serial number, and maintenance information like last_service_date.

People often inquire about maintenance and sales of their machines (100-250/day), and I keep a running record. The problem is matching the name on the phone with the machine(s) that they own. I need to match the names I have on the forms with the names on the ownership records to learn more about the machine after the fact and understand the lifecycle of the machines.

Sample Data:

"""
 This is simplified data. We often have two names on the form, and owner names
 come in first_name, last_name format but are often split in strange ways when
 multiple owners have a single machine.
"""
# Incoming Record (100-250+ per day)
{
'raw_name': 'Maria C Hernandez', 'inquire_date': '2017-11-16', 'inquire_type': 'sale'
}

# Ownership Records (1.6M+, with duplicates of NAME but not SERIAL #)
[
{'owner_1': 'HECTOR & MARIANNE HERNANDEZ', 'owner_2': '', 'serial': '3993892k'},
{'owner_1': 'MARIANA HERNANDEZ', 'owner_2': '', 'serial': '8383883hh'},
{'owner_1': 'MARIA HERNANDEZ', 'owner_2': 'TAMMY ULMER', 'serial': '123fdfe'},
{'owner_1': 'JOSE & MARIA HERNANDEZ', 'owner_2': 'MH CORP', 'serial': '223466y4'},
{'owner_1': 'MARIA C HERNANDEZ', 'owner_2': 'HIPOLITO HERNANDEZ', 'serial': '2433ff3345'},
]

Maybe I need some guidance, as well... For our heuristics, I essentially split the name fields in both data sets and compare them in 6 or 7 different ways. Now we are getting inquiries with multiple names that could help matching. Maybe more heuristics would work, but this tool seems perfect for the job.


Solution

  • This is a good use case for the Gazetteer class. I'm not sure why you think this is not appropriate?

    (I am the primary author of dedupe)