I have a spreadsheet with values like address, name, IBAN, e-mail and want to identify when a customer last time bought something.
The problem is: some fields contain spelling mistakes, others were deliberately entered wrong.
On GitHub, several libraries like https://github.com/seatgeek/fuzzywuzzy, https://github.com/seamusabshere/fuzzy_match or https://github.com/atom/fuzzaldrin are available to perform fuzzy searches based on a single and comparable column. But I want to combine multiple fields - this sounds like a common problem and I expected to find existing solutions out there.
Can you recommend approaches for such a problem? Are there existing projects for such a problem which I am missing? Is a regular string-distance over all the fields usually good enough?
I mentioned it in your other question, but the dedupe python library does what you want.
Basically, it calculates the distance between each field in a pair of rows, then learns optimal weights to combine those distances into a single record-pair score.