Search code examples
pythonpandas

Merge two pandas DataFrame based on partial match


Two DataFrames have city names that are not formatted the same way. I'd like to do a Left-outer join and pull geo field for all partial string matches between the field City in both DataFrames.

import pandas as pd

df1 = pd.DataFrame({
    'City': ['San Francisco, CA', 'Oakland, CA'],
    'Val': [1, 2]
})

df2 = pd.DataFrame({
    'City': ['San Francisco-Oakland, CA', 'Salinas, CA'],
    'Geo': ['geo1', 'geo2']
})

Expected DataFrame upon join:

City                   Val   Geo
San Francisco, CA      1     geo1
Oakland, CA            2     geo1

Solution

  • Update: the fuzzywuzzy project has been renamed to thefuzz and moved here

    You can use thefuzz package and the function extractOne:

    # Python env: pip install thefuzz
    # Anaconda env: pip install thefuzz
    # -> thefuzz is not yet available on Anaconda (2021-09-18)
    # -> you can use the old package: conda install -c conda-forge fuzzywuzzy
    
    from thefuzz import process
    
    best_city = lambda x: process.extractOne(x, df2["City"])[2]  # See note below
    df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values
    

    Output:

    >>> df1
                    City  Val   Geo
    0  San Francisco, CA    1  geo1
    1        Oakland, CA    2  geo1
    

    Note: extractOne return a tuple of 3 values from the best match: the City name from df2 [0], the accuracy score [1] and the index [2] (<- the one I use).