Thinking about the best way to merge two DataFrame

I'm looking for a way to merge df. However, I don't know what would be the best way to do this.

first df - metro cities/population/teams

Metropolitan area	Population (2016 est.)[8]	NHL
Phoenix	4661537	Coyotes
Los Angeles	13310447	Kings Ducks
Toronto	5928040	Maple Leafs
Boston	4794447	Bruins
Edmonton	1321426	Oilers
New York City	20153634	Rangers Islanders Devils

Second df - team/wins/losses

team	w	L
Los Angeles Kings	46	28
Phoenix Coyotes	37	30
Toronto Maple Leafs	49	26
Boston Bruins	50	20
Edmonton Oilers	29	44
New York Islanders	34	37

I tried to merge across teams. However, I need to arrange this data so that it collides in the Merge. I don't know how I would do that without looking at it case by case.

Note: The data set is much larger and with more cities and teams. I had a little trouble presenting the DF here, so I only put 6 rows and the main columns.

Solution

You can try to use difflib from standard python library.

# Python >= 3.8 to use the walrus operator
import difflib

teams = df2['team'].tolist()

# Adjust cutoff parameter to avoid false positive
df1['team'] = [m[0] if (m := difflib.get_close_matches(area, teams, n=1, cutoff=0.3))
                    else None
                    for area in df1['Metropolitan area']]

Output

>>> df1
  Metropolitan area  Population (2016 est.)[8]                       NHL                 team
0           Phoenix                    4661537                   Coyotes      Phoenix Coyotes
1       Los Angeles                   13310447               Kings Ducks    Los Angeles Kings
2           Toronto                    5928040               Maple Leafs  Toronto Maple Leafs
3            Boston                    4794447                    Bruins        Boston Bruins
4          Edmonton                    1321426                    Oilers      Edmonton Oilers
5     New York City                   20153634  Rangers Islanders Devils   New York Islanders

Now you can merge your dataframe on team column. This code is very easily "parallelizable" with multiprocessing if your dataframe is very large.

You can also use thefuzz module to accomplish the same task. Take a look to https://stackoverflow.com/a/69169135/15239951