I'm looking for a way to merge df. However, I don't know what would be the best way to do this.
first df - metro cities/population/teams
Metropolitan area | Population (2016 est.)[8] | NHL |
---|---|---|
Phoenix | 4661537 | Coyotes |
Los Angeles | 13310447 | Kings Ducks |
Toronto | 5928040 | Maple Leafs |
Boston | 4794447 | Bruins |
Edmonton | 1321426 | Oilers |
New York City | 20153634 | Rangers Islanders Devils |
Second df - team/wins/losses
team | w | L |
---|---|---|
Los Angeles Kings | 46 | 28 |
Phoenix Coyotes | 37 | 30 |
Toronto Maple Leafs | 49 | 26 |
Boston Bruins | 50 | 20 |
Edmonton Oilers | 29 | 44 |
New York Islanders | 34 | 37 |
I tried to merge across teams. However, I need to arrange this data so that it collides in the Merge. I don't know how I would do that without looking at it case by case.
Note: The data set is much larger and with more cities and teams. I had a little trouble presenting the DF here, so I only put 6 rows and the main columns.
You can try to use difflib
from standard python library.
# Python >= 3.8 to use the walrus operator
import difflib
teams = df2['team'].tolist()
# Adjust cutoff parameter to avoid false positive
df1['team'] = [m[0] if (m := difflib.get_close_matches(area, teams, n=1, cutoff=0.3))
else None
for area in df1['Metropolitan area']]
Output
>>> df1
Metropolitan area Population (2016 est.)[8] NHL team
0 Phoenix 4661537 Coyotes Phoenix Coyotes
1 Los Angeles 13310447 Kings Ducks Los Angeles Kings
2 Toronto 5928040 Maple Leafs Toronto Maple Leafs
3 Boston 4794447 Bruins Boston Bruins
4 Edmonton 1321426 Oilers Edmonton Oilers
5 New York City 20153634 Rangers Islanders Devils New York Islanders
Now you can merge your dataframe on team
column. This code is very easily "parallelizable" with multiprocessing if your dataframe is very large.
You can also use thefuzz
module to accomplish the same task. Take a look to https://stackoverflow.com/a/69169135/15239951