Search code examples
pythonpandasdataframedata-cleaning

Thinking about the best way to merge two DataFrame


I'm looking for a way to merge df. However, I don't know what would be the best way to do this.

first df - metro cities/population/teams

Metropolitan area Population (2016 est.)[8] NHL
Phoenix 4661537 Coyotes
Los Angeles 13310447 Kings Ducks
Toronto 5928040 Maple Leafs
Boston 4794447 Bruins
Edmonton 1321426 Oilers
New York City 20153634 Rangers Islanders Devils

Second df - team/wins/losses

team w L
Los Angeles Kings 46 28
Phoenix Coyotes 37 30
Toronto Maple Leafs 49 26
Boston Bruins 50 20
Edmonton Oilers 29 44
New York Islanders 34 37

I tried to merge across teams. However, I need to arrange this data so that it collides in the Merge. I don't know how I would do that without looking at it case by case.

Note: The data set is much larger and with more cities and teams. I had a little trouble presenting the DF here, so I only put 6 rows and the main columns.


Solution

  • You can try to use difflib from standard python library.

    # Python >= 3.8 to use the walrus operator
    import difflib
    
    teams = df2['team'].tolist()
    
    # Adjust cutoff parameter to avoid false positive
    df1['team'] = [m[0] if (m := difflib.get_close_matches(area, teams, n=1, cutoff=0.3))
                        else None
                        for area in df1['Metropolitan area']]
    

    Output

    >>> df1
      Metropolitan area  Population (2016 est.)[8]                       NHL                 team
    0           Phoenix                    4661537                   Coyotes      Phoenix Coyotes
    1       Los Angeles                   13310447               Kings Ducks    Los Angeles Kings
    2           Toronto                    5928040               Maple Leafs  Toronto Maple Leafs
    3            Boston                    4794447                    Bruins        Boston Bruins
    4          Edmonton                    1321426                    Oilers      Edmonton Oilers
    5     New York City                   20153634  Rangers Islanders Devils   New York Islanders
    

    Now you can merge your dataframe on team column. This code is very easily "parallelizable" with multiprocessing if your dataframe is very large.

    You can also use thefuzz module to accomplish the same task. Take a look to https://stackoverflow.com/a/69169135/15239951