Search code examples
pythonpandasjoinmergesubstring

How do I merge two dataframes by partial string match?


I have two dataframes of Premier League soccer players:

df1:

ID      Player              Team             Pos
1       Gabriel Dos Santos  Arsenal          DF
218     Conor Gallagher     Crystal Palace   MF
396     Gabriel Jesus       Manchester City  FW

df2:

ID  name                            team     minutes
15  Gabriel dos Santos Magalhães    Arsenal  3063
18  Gabriel Martinelli Silva        Arsenal  1855
27  Gabriel Fernando de Jesus       Arsenal  1871

I want to merge the dataframes by name/player and keep ALL rows and columns of d1 and d2 even if a name is not in both dataframes. It would look something like this:

ID  name                            team     minutes  ID   Pos  Team
15  Gabriel dos Santos Magalhães    Arsenal  3063     1    DF   Arsenal
18  Gabriel Martinelli Silva        Arsenal  1855     NA   NA   NA
27  Gabriel Fernando de Jesus       Arsenal  1871     396  FW   Manchester City
NA  Conor Gallagher                 NA       NA       218  MF   Crystal Palace

The only problem is that the names in d1 do not exactly match the names in d2 (think of d1 as a partial name, or substring of d2 names), and some of the names in d1 are not in d2 (and viceversa).

I did this:

d2[d2['name'].apply(lambda player: d1['Player'].str.contains(player)).any(1)]

But it's not working. What should I do?


Solution

  • You can use the package fuzzywuzzy to do fuzzy matching.

    import pandas as pd
    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    
    df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
    df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
    
    def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
        """
        :param df_1: the left table to join
        :param df_2: the right table to join
        :param key1: key column of the left table
        :param key2: key column of the right table
        :param threshold: how close the matches should be to return a match, based on Levenshtein distance
        :param limit: the amount of matches that will get returned, these are sorted high to low
        :return: dataframe with boths keys and matches
        """
        s = df_2[key2].tolist()
        
        m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
        df_1['matches'] = m
        
        m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
        df_1['matches'] = m2
        
        return df_1
    
    fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
    

    Result:

              Key       matches
    0       Apple          Aple
    1      Banana      Bannanna
    2      Orange          Orag
    3  Strawberry  Straw, Berry
    

    See this SO for more information.