I have this data set. df1 = 70,000 rows and df2 = ~30 rows. I want to match the address to see if df2 appears in df1 and if it does than I want to show the match and also pull info from df1 to create a new df3. Sometimes the address info is off by a bit..for example (road = rd, street = st, etc )Here's an example:
df1 =
address unique key (and more columns)
123 nice road Uniquekey1
150 spring drive Uniquekey2
240 happy lane Uniquekey3
80 sad parkway Uniquekey4
etc
df2 =
address (and more columns)
123 nice rd
150 spring dr
240 happy lane
80 sad parkway
etc
And this is what Id want a new dataframe :
df3=
address(from df2) addressed matched(from df1) unique key(comes from df1) (and more columns)
123 nice rd 123 nice road Uniquekey1
150 spring dr 150 spring drive Uniquekey2
240 happy lane 240 happy lane Uniquekey3
80 sad parkway 80 sad parkway Uniquekey4
etc
Here's what Ive tried so far using difflib:
df1['key'] = df1['address']
df2['key'] = df2['address']
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'], n=1))
this returns what looks like a list, the answer is in []'s so then I convert the df2['key'] into a string using df2['key'] = df2['key'].apply(str)
then I try to merge using df2.merge(df1, on ='key') and no address is matching?
I'm not sure what it could be but any help would be greatly appreciated. I also am playing around with the fuzzywuzzy package.
My answer is similar to one of your old questions that I answered.
I slightly modified your dataframe:
>>> df1
address unique key
0 123 nice road Uniquekey1
1 150 spring drive Uniquekey2
2 240 happy lane Uniquekey3
3 80 sad parkway Uniquekey4
>>> df2 # shuffle rows
address
0 80 sad parkway
1 240 happy lane
2 150 winter dr # change the season :-)
3 123 nice rd
Use extractOne
function from fuzzywuzzy.process
:
from fuzzywuzzy import process
THRESHOLD = 90
best_match = \
df2['address'].apply(lambda x: process.extractOne(x, df1['address'],
score_cutoff=THRESHOLD))
The output of extractOne
is:
>>> best_match
0 (80 sad parkway, 100, 3)
1 (240 happy lane, 100, 2)
2 None
3 (123 nice road, 92, 0)
Name: address, dtype: object
Now you can merge your 2 dataframes:
df3 = pd.merge(df2, df1.set_index(best_match.apply(pd.Series)[2]),
left_index=True, right_index=True, how='left')
>>> df3
address_x address_y unique key
0 80 sad parkway 80 sad parkway Uniquekey4
1 240 happy lane NaN NaN
2 150 winter dr 150 spring drive Uniquekey2
3 123 nice rd 123 nice road Uniquekey1