I'm scraping multiple sports betting websites in order to compare the odds for each match across the websites.
My question is how to identify match_id
from a match that already exists in the DB but has team names written in a different way.
Please feel free to add any approaches even if they don't use dataframes or SQLite.
The columns for matches
table are:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)
So for each new match I want to verify if it already exists in the DB.
New match = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
My pseudo-code is like:
SELECT match_id FROM matches
WHERE sport = (sport_to_check)
AND date = (date_to_check)
AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation
If no match is found the new row would be inserted.
I believe there's no way to mix python with SQL like that so that's why I refer to it as "pseudo-code".
I can also pull the matches table into a Pandas dataframe and do the evaluation with it, if that works (how?...).
At any given time it isn't expected for matches table to have above a couple of thousand records.
Let me give you some examples of expected outputs. Where the solution is represented by "find(row)"
Having matches table in DB as:
+----------+------------+-----------------------------+----------------------+------------+
| match_id | sport | home_team | visitor_team | date |
+----------+------------+-----------------------------+----------------------+------------+
| 84 | football | confianca | cuiaba esporte clube | 24/11/2020 |
| 209 | football | cs alagoana | operario pr | 24/11/2020 |
| 184 | football | grenoble foot 38 | as nancy lorraine | 24/11/2020 |
| 7 | football | sv turkgucu-ataspor munchen | saarbrucken | 24/11/2020 |
| 414 | handball | dinamo bucareste | usam nimes | 24/11/2020 |
| 846 | handball | benidorm | naturhouse la rioja | 25/11/2020 |
| 874 | handball | cegledi | ferencvarosi tc | 25/11/2020 |
| 418 | handball | lemvig-thyboron | kif kolding | 25/11/2020 |
| 740 | ice hockey | tps | kookoo | 25/11/2020 |
| 385 | football | stevenage | hull | 29/11/2020 |
+----------+------------+-----------------------------+----------------------+------------+
And new matches to evaluate:
+----------------+------------+---------------------+---------------------+------------+
| row (for demo) | sport | home_team | visitor_team | date |
+----------------+------------+---------------------+---------------------+------------+
| A | football | confianca-se | cuiaba mt | 24/11/2020 |
| B | football | csa | operario | 24/11/2020 |
| C | football | grenoble | nancy | 24/11/2020 |
| D | football | sv turkgucu ataspor | 1 fc saarbrucken | 24/11/2020 |
| E | handball | dinamo bucuresti | nimes | 24/11/2020 |
| F | handball | bm benidorm | bm logrono la rioja | 25/11/2020 |
| G | handball | cegledi kkse | ftc budapest | 25/11/2020 |
| H | handball | lemvig | kif kobenhavn | 25/11/2020 |
| I | ice hockey | turku ps | kookoo kouvola | 25/11/2020 |
| J | football | stevenage borough | hull city | 29/11/2020 |
| K | football | west brom | sheffield united | 28/11/2020 |
+----------------+------------+---------------------+---------------------+------------+
Outputs:
find(A) returns: 84
find(B) returns: 209
find(C) returns: 184
find(D) returns: 7
find(E) returns: 414
find(F) returns: 846
find(G) returns: 874
find(H) returns: 418
find(I) returns: 740
find(J) returns: 385
find(K) returns: (something like "not found" => I would then insert the new row)
Thanks!
Basically I filter down the original table by the given date and sport. then use fuzzywuzzy to find the best match between the home and visitors between the rows remaining:
Setup:
import pandas as pd
cols = ['match_id','sport','home_team','visitor_team','date']
df1 = pd.DataFrame([
['84','football','confianca','cuiaba esporte clube','24/11/2020'],
['209','football','cs alagoana','operario pr','24/11/2020'],
['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
['740','ice hockey','tps','kookoo','25/11/2020'],
['385','football','stevenage','hull','29/11/2020']], columns=cols)
cols = ['row','sport','home_team','visitor_team','date']
df2 = pd.DataFrame([
['A','football','confianca-se','cuiaba mt','24/11/2020'],
['B','football','csa','operario','24/11/2020'],
['C','football','grenoble','nancy','24/11/2020'],
['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
['E','handball','dinamo bucuresti','nimes','24/11/2020'],
['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
['H','handball','lemvig','kif kobenhavn','25/11/2020'],
['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
['J','football','stevenage borough','hull city','29/11/2020'],
['K','football','west brom','sheffield united','28/11/2020']], columns=cols)
Code:
import pandas as pd
from fuzzywuzzy import fuzz
import string
def calculate_ratio(row):
return fuzz.token_set_ratio(row['col1'],row['col2'] )
def find(df1, df2, row_search):
alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
sport = alpha.iloc[0]['sport']
date = alpha.iloc[0]['date']
home_team = alpha.iloc[0]['home_team']
visitor_team = alpha.iloc[0]['visitor_team']
beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
if len(beta) == 0:
return 'Not found.'
else:
temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
temp['score'] = temp.apply(calculate_ratio, axis=1)
temp = temp.sort_values('score', ascending=False)
outcome = temp.head(1).iloc[0]['match_id']
return outcome
for row_alpha in string.ascii_uppercase[0:11]:
outcome = find(df1, df2, row_alpha)
print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))
Output:
A --> 84
B --> 209
C --> 184
D --> 7
E --> 414
F --> 846
G --> 874
H --> 418
I --> 740
J --> 385
K --> Not found.