Search code examples
pythonpandasfuzzy

Perform Fuzzy Matching in 2 pandas dataframe


I have two dataframes with different rows numbers contain information about players. The first has all names that I need.

df1 = pd.DataFrame({'Player': ["John Sepi", 'Zan Fred', 'Mark Daniel', 'Adam Pop', 'Paul Sepi', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'C', 'E', 'C', 'B', 'D', 'B', 'A', 'D']})

The another dataframe is missing some players, but has a column with age. The player's names have smaller differences in some cases.

df2 = pd.DataFrame({'Player': ["John Sepi", 'Mark A. Daniel', 'John Hernandez', 'Price Josiah', 'John Hernandez', 'Adam Pop'], 
                    'Team': ['A', 'E', 'D', 'B', 'A', 'D'],
                   'Age': [22, 21, 26, 18, 19, 25]})

The equals names are different persons, because of that i need match at the same time Player and Team. I want to create a new dataframe with all names from first dataframe with respective age from second dataframe. In case of missing players in second, complete new dataframe with constant value(like XX years, can be any age..just to illustrate). The final dataframe:

print(final_df)
           Player Team  Age
0       John Sepi    A   22
1        Zan Fred    C   XX
2     Mark Daniel    E   21
3        Adam Pop    C   XX
4       Paul Sepi    B   XX
5  John Hernandez    D   26
6    Price Josiah    B   18
7  John Hernandez    A   19
8        Adam Pop    D   25


Solution

  • You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.

    First, import the following libraries :

    import pandas as pd
    import numpy as np
    from fuzzywuzzy import fuzz
    from fuzzywuzzy import process
    

    You can use the text matching capabilities of the fuzzywuzzy python library :

    #get list of unique teams existing in df1
    lst_teams = list(np.unique(np.array(df1['Team'])))
    #define arbitrary threshold
    thres = 70
    #for each team match similar texts
    for team in lst_teams:
        #iterration on dataframe filtered by team
        for index, row in df1.loc[df1['Team']==team].iterrows():
            #get list of players in this team
            lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
            #use of fuzzywuzzy to make text matching
            output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
            #check if there is players from df2 in this team
            if output_ratio !=[]:
                #put arbitrary threshold to get most similar text
                if output_ratio[0][1]>thres:
                    df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
    df1 = df1.fillna('XX')
    

    with this code and a threshold defined as 70, you get the following result:

    print(df1)
               Player Team Age
    0       John Sepi    A  22
    1        Zan Fred    C  XX
    2     Mark Daniel    E  21
    3        Adam Pop    C  XX
    4       Paul Sepi    B  XX
    5  John Hernandez    D  26
    6    Price Josiah    B  18
    7  John Hernandez    A  19
    8        Adam Pop    D  25
    

    It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.

    Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.

    You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/