Search code examples
pythonpandasdataframepandas.excelwriter

How do I limit the rate of a scraper?


So I am trying to create a table by scraping hundreds of similar pages at a time and then saving them into the same Excel table, with something like this:

#let urls be a list of hundreds of different URLs
def save_table(urls):
   <define columns and parameters of the dataframe to be saved, df>
   writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
   for i in range(0, len(urls)):
       #here, return_html_soup is the function returning the html soup of any individual URL 
       soup = return_html_soup(urls[i])
       temp_table = some_function(soup)
       df = df.append(temp_table, ignore_index = True)

   #I chose to_excel instead of to_csv here because there are certain letters on the 
   #original website that don't show up in a CSV
   df.to_excel(writer, sheet_name = <some name>)
   writer.save()
   writer.close()

I now hit the HTTP Error 429: too many requests, without any retry-after header.

Is there a way for me to get around this? I know that this error happens because I've basically asked to scrape too many websites in too short of an interval. Is there a way for me to limit the rate that my code opens links?


Solution

  • Python official documentation is the best place to go: https://docs.python.org/3/library/time.html#time.sleep

    Here an example using 5 seconds. But you can customize it according to what you need and the restrictions you have.

    import time
    
    
    #let urls be a list of hundreds of different URLs
    def save_table(urls):
       <define columns and parameters of the dataframe to be saved, df>
       writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
       for i in range(0, len(urls)):
           #here, return_html_soup is the function returning the html soup of any individual URL 
           soup = return_html_soup(urls[i])
           temp_table = some_function(soup)
           df = df.append(temp_table, ignore_index = True)
    
           #New cote to wait for some time
           time.sleep(5)
    
       #I chose to_excel instead of to_csv here because there are certain letters on the 
       #original website that don't show up in a CSV
       df.to_excel(writer, sheet_name = <some name>)
       writer.save()
       writer.close()