I have a list of dictionaries of sports games with various attributes. These lists are all made with only one sport in the list. So I have a basketball games list, baseball games list, etc.
I want to format all of the "game" values the same way so the same particular sports match (Pit Steelers vs LA Rams) for example all have the same string value. Examples of a particular game could be "PIT Steelers vs LA Rams"
but this also could be formatted "Pittsburgh Steelers vs Los Angeles Rams"
. I may have up to 7 dictionaries within the list for a particular game formatted in slightly different ways.
I can't choose to just use the team name or the city because within the same sport there could be the same match with those particular team names or cities just in two different leagues like the NFL and the NCAA.
I was thinking I would use the most expansive game name as the key. For example, I would use "Pittsburgh Steelers vs Los Angeles Rams"
instead of "PIT Steelers vs LA Rams"
as the key to use as a baseline.
Is there a way I could compare these other matches to the key and say if there is above X percentage of this string in the key replace this string game with the key? How would you do it? I am open to all suggestions!
Thanks!
Edit: Here is an attempt using difflib. I generated 1000 random games and imported into excel and sorted by ratio. We can see that it isn't a perfect fit.
Create a list of short team names that doesn't include the city, then scan the title for those short names. You should find two short team names in each title, which you can then use for grouping the titles into unique games.
team_long_names = ['Arizona Cardinals', 'Atlanta Falcons', 'Carolina Panthers', 'Chicago Bears',
'Dallas Cowboys', 'Detriot Lions','Green Bay Packers','Los Angeles Rams',
'Minnesota Vikings','New Orleans Saints','New York Giants', 'Philadelphia Eagles',
'San Francisco 49ers','Seattle Seahawks','Washington Redskins','Baltimore Ravens',
'Buffalo Bills','Cinncinnati Bangals','Cleveland Browns','Denver Broncos',
'Houston Texans','Indanapolis Colts','Jacksonville Jaguars','Kansas City Chiefs',
'Las Vegas Raiders','Los Angeles Chargers','Miami Dolphins','New England Patriots',
'New York Jets','Pittsburgh Steelers','Tennessee Titans']
team_short_names = [n.lower().split(' ')[-1] for n in team_long_names]
game_titles = ['Atlanta Falcons vs New York Jets', 'ATL Falcons vs NY Jets', 'Falcons v Jets',
'SF 49ers vs PIT Steelers', 'San Fransico 49ers vs Pittsburg Steelers', '49ers vs Steelers',
'Dallas Cowboys vs LA Chargers', 'DAL Cowboys vs Los Angles Chargers', 'Cowboys v Chargers',
'Blah blah Falcons and Foo bar Jets']
titles_by_key = []
for title in game_titles:
game_key = '-'.join([word for word in title.lower().split(' ') if word in team_short_names])
titles_by_key.append(game_key + ": " + title)
print(sorted(titles_by_key))
Output:
['49ers-steelers: 49ers vs Steelers',
'49ers-steelers: SF 49ers vs PIT Steelers',
'49ers-steelers: San Fransico 49ers vs Pittsburg Steelers',
'cowboys-chargers: Cowboys v Chargers',
'cowboys-chargers: DAL Cowboys vs Los Angles Chargers',
'cowboys-chargers: Dallas Cowboys vs LA Chargers',
'falcons-jets: ATL Falcons vs NY Jets',
'falcons-jets: Atlanta Falcons vs New York Jets',
'falcons-jets: Blah blah Falcons and Foo bar Jets',
'falcons-jets: Falcons v Jets']
That doesn't solve the problem of possible team name collisions with different leagues, but I suspect there might be easier strategies for detecting the league as a pre-processing step.