Search code examples
pythonsqlitedataframefuzzy-searchfuzzywuzzy

Pandas dataframe or SQLite fuzzy search


I'm scraping multiple sports betting websites in order to compare the odds for each match across the websites.

My question is how to identify match_id from a match that already exists in the DB but has team names written in a different way.
Please feel free to add any approaches even if they don't use dataframes or SQLite.

The columns for matches table are:
match_id: int, sport: string, home_team: string, away_team: string, date: string (dd/mm/YYY)

So for each new match I want to verify if it already exists in the DB.
New match = (sport_to_check, home_team_to_check, away_team_to_check, date_to_check)
My pseudo-code is like:

    SELECT match_id FROM matches
    WHERE sport = (sport_to_check)
    AND date = (date_to_check)
    AND (fuzz(home_team, home_team_to_check) > 80 OR fuzz(away_team, away_team_to_check) > 80) //the fuzzy scores evaluation  

If no match is found the new row would be inserted.

I believe there's no way to mix python with SQL like that so that's why I refer to it as "pseudo-code". I can also pull the matches table into a Pandas dataframe and do the evaluation with it, if that works (how?...).
At any given time it isn't expected for matches table to have above a couple of thousand records.

Let me give you some examples of expected outputs. Where the solution is represented by "find(row)"
Having matches table in DB as:

    +----------+------------+-----------------------------+----------------------+------------+
    | match_id | sport      | home_team                   | visitor_team         | date       |
    +----------+------------+-----------------------------+----------------------+------------+
    | 84       | football   | confianca                   | cuiaba esporte clube | 24/11/2020 |
    | 209      | football   | cs alagoana                 | operario pr          | 24/11/2020 |
    | 184      | football   | grenoble foot 38            | as nancy lorraine    | 24/11/2020 |
    | 7        | football   | sv turkgucu-ataspor munchen | saarbrucken          | 24/11/2020 |
    | 414      | handball   | dinamo bucareste            | usam nimes           | 24/11/2020 |
    | 846      | handball   | benidorm                    | naturhouse la rioja  | 25/11/2020 |
    | 874      | handball   | cegledi                     | ferencvarosi tc      | 25/11/2020 |
    | 418      | handball   | lemvig-thyboron             | kif kolding          | 25/11/2020 |
    | 740      | ice hockey | tps                         | kookoo               | 25/11/2020 |
    | 385      | football   | stevenage                   | hull                 | 29/11/2020 |
    +----------+------------+-----------------------------+----------------------+------------+

And new matches to evaluate:

    +----------------+------------+---------------------+---------------------+------------+
    | row (for demo) | sport      | home_team           | visitor_team        | date       |
    +----------------+------------+---------------------+---------------------+------------+
    | A              | football   | confianca-se        | cuiaba mt           | 24/11/2020 |
    | B              | football   | csa                 | operario            | 24/11/2020 |
    | C              | football   | grenoble            | nancy               | 24/11/2020 |
    | D              | football   | sv turkgucu ataspor | 1 fc saarbrucken    | 24/11/2020 |
    | E              | handball   | dinamo bucuresti    | nimes               | 24/11/2020 |
    | F              | handball   | bm benidorm         | bm logrono la rioja | 25/11/2020 |
    | G              | handball   | cegledi kkse        | ftc budapest        | 25/11/2020 |
    | H              | handball   | lemvig              | kif kobenhavn       | 25/11/2020 |
    | I              | ice hockey | turku ps            | kookoo kouvola      | 25/11/2020 |
    | J              | football   | stevenage borough   | hull city           | 29/11/2020 |
    | K              | football   | west brom           | sheffield united    | 28/11/2020 |
    +----------------+------------+---------------------+---------------------+------------+

Outputs:

find(A) returns: 84  
find(B) returns: 209  
find(C) returns: 184  
find(D) returns: 7  
find(E) returns: 414  
find(F) returns: 846  
find(G) returns: 874  
find(H) returns: 418  
find(I) returns: 740  
find(J) returns: 385  
find(K) returns: (something like "not found" => I would then insert the new row)  

Thanks!


Solution

  • Basically I filter down the original table by the given date and sport. then use fuzzywuzzy to find the best match between the home and visitors between the rows remaining:

    Setup:

    import pandas as pd
    
    cols = ['match_id','sport','home_team','visitor_team','date']
    
    df1 = pd.DataFrame([
    ['84','football','confianca','cuiaba esporte clube','24/11/2020'],
    ['209','football','cs alagoana','operario pr','24/11/2020'],
    ['184','football','grenoble foot 38','as nancy lorraine','24/11/2020'],
    ['7','football','sv turkgucu-ataspor munchen','saarbrucken','24/11/2020'],
    ['414','handball','dinamo bucareste','usam nimes','24/11/2020'],
    ['846','handball','benidorm','naturhouse la rioja','25/11/2020'],
    ['874','handball','cegledi','ferencvarosi tc','25/11/2020'],
    ['418','handball','lemvig-thyboron','kif kolding','25/11/2020'],
    ['740','ice hockey','tps','kookoo','25/11/2020'],
    ['385','football','stevenage','hull','29/11/2020']], columns=cols)
    
    
    cols = ['row','sport','home_team','visitor_team','date']
    
    df2 = pd.DataFrame([
    ['A','football','confianca-se','cuiaba mt','24/11/2020'],
    ['B','football','csa','operario','24/11/2020'],
    ['C','football','grenoble','nancy','24/11/2020'],
    ['D','football','sv turkgucu ataspor','1 fc saarbrucken','24/11/2020'],
    ['E','handball','dinamo bucuresti','nimes','24/11/2020'],
    ['F','handball','bm benidorm','bm logrono la rioja','25/11/2020'],
    ['G','handball','cegledi kkse','ftc budapest','25/11/2020'],
    ['H','handball','lemvig','kif kobenhavn','25/11/2020'],
    ['I','ice hockey','turku ps','kookoo kouvola','25/11/2020'],
    ['J','football','stevenage borough','hull city','29/11/2020'],
    ['K','football','west brom','sheffield united','28/11/2020']], columns=cols)
    

    Code:

    import pandas as pd
    from fuzzywuzzy import fuzz
    import string
    
    def calculate_ratio(row):
        return fuzz.token_set_ratio(row['col1'],row['col2'] )
    
    def find(df1, df2, row_search):
        alpha = df2.query('row == "{row_search}"'.format(row_search=row_search))
        sport = alpha.iloc[0]['sport']
        date = alpha.iloc[0]['date']
        home_team = alpha.iloc[0]['home_team']
        visitor_team = alpha.iloc[0]['visitor_team']
        
        beta = df1.query('sport == "{sport}" & date == "{date}"'.format(sport=sport,date=date))
        
        if len(beta) == 0:
            return 'Not found.'
        else:
            temp = pd.DataFrame({'match_id':list(beta['match_id']),'col1':list(beta['home_team'] + ' ' + beta['visitor_team']), 'col2':[home_team + ' ' + visitor_team]*len(beta)})
            temp['score'] = temp.apply(calculate_ratio, axis=1)
            temp = temp.sort_values('score', ascending=False)
            outcome = temp.head(1).iloc[0]['match_id']
            return outcome
    
    
    for row_alpha in string.ascii_uppercase[0:11]:
        outcome = find(df1, df2, row_alpha)
        print ('{row_alpha} --> {outcome}'.format(row_alpha=row_alpha, outcome=outcome))
    

    Output:

    A --> 84
    B --> 209
    C --> 184
    D --> 7
    E --> 414
    F --> 846
    G --> 874
    H --> 418
    I --> 740
    J --> 385
    K --> Not found.