Two DataFrames have city names that are not formatted the same way. I'd like to do a Left-outer join and pull geo
field for all partial string matches between the field City
in both DataFrames.
import pandas as pd
df1 = pd.DataFrame({
'City': ['San Francisco, CA', 'Oakland, CA'],
'Val': [1, 2]
})
df2 = pd.DataFrame({
'City': ['San Francisco-Oakland, CA', 'Salinas, CA'],
'Geo': ['geo1', 'geo2']
})
Expected DataFrame upon join:
City Val Geo
San Francisco, CA 1 geo1
Oakland, CA 2 geo1
Update: the fuzzywuzzy
project has been renamed to thefuzz
and moved here
You can use thefuzz
package and the function extractOne
:
# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy
from thefuzz import process
best_city = lambda x: process.extractOne(x, df2["City"])[2] # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values
Output:
>>> df1
City Val Geo
0 San Francisco, CA 1 geo1
1 Oakland, CA 2 geo1
Note: extractOne
return a tuple of 3 values from the best match: the City name from df2
[0], the accuracy score [1] and the index [2] (<- the one I use).