Search code examples
pythonpandasdataframefuzzywuzzydifflib

Fuzzy-compare two dataframes of addresses and copy info from 1 to another


I have this data set. df1 = 70,000 rows and df2 = ~30 rows. I want to match the address to see if df2 appears in df1 and if it does than I want to show the match and also pull info from df1 to create a new df3. Sometimes the address info is off by a bit..for example (road = rd, street = st, etc )Here's an example:

df1 = 

address                unique key (and more columns)

123 nice road           Uniquekey1
150  spring drive       Uniquekey2
240 happy lane          Uniquekey3
80 sad parkway          Uniquekey4
etc


df2 =

address            (and more columns)

123 nice rd          
150  spring dr      
240 happy lane          
80 sad parkway         
etc

And this is what Id want a new dataframe :

df3=

address(from df2)     addressed matched(from df1)     unique key(comes from df1) (and more columns)      

123 nice rd            123 nice road                    Uniquekey1
150  spring dr         150  spring drive                Uniquekey2
240 happy lane         240 happy lane                   Uniquekey3
 80 sad parkway        80 sad parkway                   Uniquekey4
etc            

Here's what Ive tried so far using difflib:

df1['key'] = df1['address']
df2['key'] = df2['address']

df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))

this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)

then I try to merge using df2.merge(df1, on ='key') and no address is matching?

I'm not sure what it could be but any help would be greatly appreciated. I also am playing around with the fuzzywuzzy package.


Solution

  • My answer is similar to one of your old questions that I answered.

    I slightly modified your dataframe:

    >>> df1
                 address  unique key
    0      123 nice road  Uniquekey1
    1  150  spring drive  Uniquekey2
    2     240 happy lane  Uniquekey3
    3     80 sad parkway  Uniquekey4
    
    >>> df2  # shuffle rows
              address
    0  80 sad parkway
    1  240 happy lane
    2  150  winter dr  # change the season :-)
    3     123 nice rd
    

    Use extractOne function from fuzzywuzzy.process:

    from fuzzywuzzy import process
    
    THRESHOLD = 90
    
    best_match = \
        df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
                                                          score_cutoff=THRESHOLD))
    

    The output of extractOne is:

    >>> best_match
    0    (80 sad parkway, 100, 3)
    1    (240 happy lane, 100, 2)
    2                        None
    3      (123 nice road, 92, 0)
    Name: address, dtype: object
    

    Now you can merge your 2 dataframes:

    df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
                   left_index=True, right_index=True, how='left')
    
    >>> df3
            address_x          address_y  unique key
    0  80 sad parkway     80 sad parkway  Uniquekey4
    1  240 happy lane                NaN         NaN
    2  150  winter dr  150  spring drive  Uniquekey2
    3     123 nice rd      123 nice road  Uniquekey1