Search code examples
pythonduplicatesfuzzywuzzyrecord-linkagepython-dedupe

Python Record Linkage, Fuzzy Match and Deduplication


I have 3 dataset of customers with 7 columns.

CustomerName

Address

Phone

StoreName

Mobile

Longitude

Latitude

every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it???? Do you know good library for my case?


Solution

  • I think Recordlinkage library would suit your purposes

    you can use to the Compare object , requiring various kinds of matches:

    compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
    compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
    compare_cl.string('Address', 'Address', threshold=0.85, label='Address')
    

    then defining the match you can customize how you want results, ie if you want 2 features to be matched at least

    features = compare_cl.compute(pairs, df)    
    matches = features[features.sum(axis=1) > 3]