So I am trying to create a table by scraping hundreds of similar pages at a time and then saving them into the same Excel table, with something like this:
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()
I now hit the HTTP Error 429: too many requests, without any retry-after header.
Is there a way for me to get around this? I know that this error happens because I've basically asked to scrape too many websites in too short of an interval. Is there a way for me to limit the rate that my code opens links?
Python official documentation is the best place to go: https://docs.python.org/3/library/time.html#time.sleep
Here an example using 5 seconds. But you can customize it according to what you need and the restrictions you have.
import time
#let urls be a list of hundreds of different URLs
def save_table(urls):
<define columns and parameters of the dataframe to be saved, df>
writer = pd.ExcelWriter(<address>, engine = 'xlsxwriter')
for i in range(0, len(urls)):
#here, return_html_soup is the function returning the html soup of any individual URL
soup = return_html_soup(urls[i])
temp_table = some_function(soup)
df = df.append(temp_table, ignore_index = True)
#New cote to wait for some time
time.sleep(5)
#I chose to_excel instead of to_csv here because there are certain letters on the
#original website that don't show up in a CSV
df.to_excel(writer, sheet_name = <some name>)
writer.save()
writer.close()