Search code examples
pythonpython-multiprocessing

How to speed up this python script with multiprocessing


I have a script that get data from a dataframe, use those data to make a request to a website, using fuzzywuzzy module find the exact href and then runs a function to scrape odds. I would speed up this script with the multiprocessing module, it is possible?


                           Date       HomeTeam         AwayTeam
0  Monday 6 December 2021 20:00        Everton          Arsenal
1  Monday 6 December 2021 17:30         Empoli          Udinese
2  Monday 6 December 2021 19:45       Cagliari           Torino
3  Monday 6 December 2021 20:00         Getafe  Athletic Bilbao
4  Monday 6 December 2021 15:00  Real Zaragoza            Eibar
5  Monday 6 December 2021 17:15      Cartagena         Tenerife
6  Monday 6 December 2021 20:00         Girona          Leganes
7  Monday 6 December 2021 19:45          Niort         Toulouse
8  Monday 6 December 2021 19:00      Jong Ajax         FC Emmen
9  Monday 6 December 2021 19:00        Jong AZ        Excelsior

Script

  df = pd.read_excel(path)

  dates = df.Date
  hometeams = df.HomeTeam
  awayteams = df.AwayTeam

  matches_odds = list()

  for i,(a,b,c) in enumerate(zip(dates, hometeams, awayteams)):
      try:
        r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
      except requests.exceptions.ConnectionError:
        sleep(10)
        r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
      
      soup = BeautifulSoup(r.text, 'html.parser')
      f = soup.find_all('td', class_="table-main__tt")

      for tag in f: 
          match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
          hour = a.split(" ")[4]
          if hour.split(':')[0] == '23':
              act_hour = '00' + ':' + hour.split(':')[1]
          else:
              act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
          if match > 70 and act_hour == tag.find('span').text:
              href_id = tag.find('a')['href']

              table = get_odds(href_id)
              matches_odds.append(table)
          
      print(i, ' of ', len(dates))

PS: The monthToNum function just replace the month name to his number


Solution

  • First, you make a function of your loop body with inputs i, a, b and c. Then, you create a multiprocessing.Pool and submit this function with the proper arguments (i, a, b, c) to the pool.

    import multiprocessing
    
    df = pd.read_excel(path)
    
    dates = df.Date
    hometeams = df.HomeTeam
    awayteams = df.AwayTeam
    
    matches_odds = list()
    
    def fetch(data):
        i, (a, b, c) = data
        try:
            r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
        except requests.exceptions.ConnectionError:
            sleep(10)
            r = requests.get(f'https://www.betexplorer.com/results/soccer/?year={a.split(" ")[3]}&month={monthToNum(a.split(" ")[2])}&day={a.split(" ")[1]}')
          
        soup = BeautifulSoup(r.text, 'html.parser')
        f = soup.find_all('td', class_="table-main__tt")
    
        for tag in f: 
            match = fuzz.ratio(f'{b} - {c}', tag.find('a').text)
            hour = a.split(" ")[4]
            if hour.split(':')[0] == '23':
                act_hour = '00' + ':' + hour.split(':')[1]
            else:
                act_hour = str(int(hour.split(':')[0]) + 1) + ':' + hour.split(':')[1]
            if match > 70 and act_hour == tag.find('span').text:
                href_id = tag.find('a')['href']
    
                table = get_odds(href_id)
                matches_odds.append(table)
              
        print(i, ' of ', len(dates))
    
    if __name__ == '__main__':  
        num_processes = 20
        with multiprocessing.Pool(num_processes) as pool:
            pool.map(fetch, enumerate(zip(dates, hometeams, awayteams)))
    

    Besides, multiprocessing is not the only way to improve the speed. Asynchronous programming can be used as well and is probably better for this scenario, although multiprocessing does the job, too - just want to mention that.

    If carefully read the Python multiprocessing documentation, then it'll be obvious.