python pandas dataframe fuzzywuzzy difflib

Fuzzy-compare two dataframes of addresses and copy info from 1 to another

I have this data set. df1 = 70,000 rows and df2 = ~30 rows. I want to match the address to see if df2 appears in df1 and if it does than I want to show the match and also pull info from df1 to create a new df3. Sometimes the address info is off by a bit..for example (road = rd, street = st, etc )Here's an example:

df1 = 

address                unique key (and more columns)

123 nice road           Uniquekey1
150  spring drive       Uniquekey2
240 happy lane          Uniquekey3
80 sad parkway          Uniquekey4
etc


df2 =

address            (and more columns)

123 nice rd          
150  spring dr      
240 happy lane          
80 sad parkway         
etc

And this is what Id want a new dataframe :

df3=

address(from df2)     addressed matched(from df1)     unique key(comes from df1) (and more columns)      

123 nice rd            123 nice road                    Uniquekey1
150  spring dr         150  spring drive                Uniquekey2
240 happy lane         240 happy lane                   Uniquekey3
 80 sad parkway        80 sad parkway                   Uniquekey4
etc

Here's what Ive tried so far using difflib:

df1['key'] = df1['address']
df2['key'] = df2['address']

df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))

this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)

then I try to merge using df2.merge(df1, on ='key') and no address is matching?

I'm not sure what it could be but any help would be greatly appreciated. I also am playing around with the fuzzywuzzy package.

Solution

My answer is similar to one of your old questions that I answered.

I slightly modified your dataframe:

>>> df1
             address  unique key
0      123 nice road  Uniquekey1
1  150  spring drive  Uniquekey2
2     240 happy lane  Uniquekey3
3     80 sad parkway  Uniquekey4

>>> df2  # shuffle rows
          address
0  80 sad parkway
1  240 happy lane
2  150  winter dr  # change the season :-)
3     123 nice rd

Use extractOne function from fuzzywuzzy.process:

from fuzzywuzzy import process

THRESHOLD = 90

best_match = \
    df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
                                                      score_cutoff=THRESHOLD))

The output of extractOne is:

>>> best_match
0    (80 sad parkway, 100, 3)
1    (240 happy lane, 100, 2)
2                        None
3      (123 nice road, 92, 0)
Name: address, dtype: object

Now you can merge your 2 dataframes:

df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
               left_index=True, right_index=True, how='left')

>>> df3
        address_x          address_y  unique key
0  80 sad parkway     80 sad parkway  Uniquekey4
1  240 happy lane                NaN         NaN
2  150  winter dr  150  spring drive  Uniquekey2
3     123 nice rd      123 nice road  Uniquekey1