I have 2 lists of potentially overlapping movie titles, but possibly written in a different form.
They are in 2 different dataframes from pandas. So I have tried to use the map()
function with the fuzzywuzzy
library like so:
df1.title.map(lambda x: process.extractOne(x, choices=df2.title, score_cutoff=95))
Which gives some good quality results. But the time complexity is such that I can only run it on a very small subset of both data frames. When I try incrementing the size of the data frames then it rapidly becomes unusable.
Then I tried to replace fuzzywuzzy
with difflib
. And it is much faster. But I can't get the results I want.
At first I tried:
df1.title.map(lambda x: difflib.get_close_matches(x, df2.title, n=1)
And that was fast but the quality of the results was poor. Even missing some simple uppercase / lowercase changes. Playing with the cutoff
did not help.
So I thought I was using the wrong tool. In the docs and examples I saw get_close_matches
used on single words. In titles there are various words.
Is SequenceMatcher
a better choice?
And if yes, then how do I fit it into the map()
so it does the same as the aforementioned functions: return only the best result, and only if the result is above a certain ratio?
I have written a Python package which aims to solve this problem. Amongst other things, it addresses the n^2 complexity of the problem (e.g. with two datasets of length 100, your code needs 10,000 comparisons).
You can install it using pip install fuzzymatcher
You can find the repo here and docs here.
Basic usage:
Given two dataframes df_left
and df_right
, which you want to fuzzy join, you can write the following:
from fuzzymatcher import link_table, left join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
Or if you just want to link on the closest match:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)