I have two dataframes of Premier League soccer players:
df1:
ID Player Team Pos
1 Gabriel Dos Santos Arsenal DF
218 Conor Gallagher Crystal Palace MF
396 Gabriel Jesus Manchester City FW
df2:
ID name team minutes
15 Gabriel dos Santos Magalhães Arsenal 3063
18 Gabriel Martinelli Silva Arsenal 1855
27 Gabriel Fernando de Jesus Arsenal 1871
I want to merge the dataframes by name/player and keep ALL rows and columns of d1 and d2 even if a name is not in both dataframes. It would look something like this:
ID name team minutes ID Pos Team
15 Gabriel dos Santos Magalhães Arsenal 3063 1 DF Arsenal
18 Gabriel Martinelli Silva Arsenal 1855 NA NA NA
27 Gabriel Fernando de Jesus Arsenal 1871 396 FW Manchester City
NA Conor Gallagher NA NA 218 MF Crystal Palace
The only problem is that the names in d1 do not exactly match the names in d2 (think of d1 as a partial name, or substring of d2 names), and some of the names in d1 are not in d2 (and viceversa).
I did this:
d2[d2['name'].apply(lambda player: d1['Player'].str.contains(player)).any(1)]
But it's not working. What should I do?
You can use the package fuzzywuzzy
to do fuzzy matching.
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
:param df_1: the left table to join
:param df_2: the right table to join
:param key1: key column of the left table
:param key2: key column of the right table
:param threshold: how close the matches should be to return a match, based on Levenshtein distance
:param limit: the amount of matches that will get returned, these are sorted high to low
:return: dataframe with boths keys and matches
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Result:
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
See this SO for more information.