Search code examples
pythonpandasfuzzy-comparison

Fuzzy Match between large number of records


I have two data frames. One contains 33765 companies. Another contains 358839 companies. I want to find the matching between the two using fuzzy match. Because the number of records are too high, I am trying to break down the records of both data frames according to 1st letter of the company name. For example: For all the companies starting with letter "A", 1st data frame has 2600 records, and 2nd has 25000 records. I am implementing full merge between them and then applying fuzzy match to get all the companies with fuzz value more than 95. This still does not work because number of records are still too high to perform full merge between them and then implement fuzzy. Kernel dies every time I do these operations. The same approach was working fine when the number of records in both frames was 4-digit. Also, suggest if there is a way to automates this for all letters 'A' to 'Z', instead of manually running the code for each letter (without making kernel die).

Here's my code:

c='A'
df1 = df1[df1.companyName.str[0] == c ].copy()
df2 = df2[df2.companyName.str[0] == c].copy()
df1['Join'] =1
df2['Join'] =1
df3 = pd.merge(df1,df2, left_on='Join',right_on='Join')
df3['Fuzz'] = df3.apply(lambda x: fuzz.ratio(x['companyName_x'], x['companyName_y']) , axis=1)
df3.sort_values(['companyName_x','Fuzz'],ascending=False, inplace=True)
df4 = df3.groupby('companyName_x',as_index=False).first()
df5=df4[df4.Fuzz>=95]

Solution

  • You started going down the right path by chunking records based on a shared attributed (the first letter). In the record linkage literature, this concept is called blocking and it's critical to reducing the number of comparisons to something tractable.

    The way forward is to find even better blocking rules: maybe first five characters, or a whole word in common.

    The dedupe library can help you find good blocking rules. (I'm a core dev for this library)