When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,
sleep(10)
response = requests.get(url)
and,
response = requests.get(url, timeout=10)
That is, timeout
is much faster.
Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.
I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.
time.sleep
stops your script from running for certain amount of seconds, while the timeout
is the maximum time wait for retrieving the url. If the data is retrieved before the timeout
time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout
.
time.sleep
is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep
will take more than 10 seconds every time.
They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.