Search code examples
pythontimeoutscreen-scrapingsleepdifference

Python web scraping: difference between sleep and request(page, timeout=x)


When scraping multiple websites in a loop, I notice there is a rather large difference in speed between,

sleep(10)
response = requests.get(url)

and,

response = requests.get(url, timeout=10)

That is, timeout is much faster.

Moreover, for both set-ups I expected a scraping duration of at least 10 seconds per page before requesting the next page, but this is not the case.

  1. Why is there such a difference in speed?
  2. Why is the scraping duration per page less than 10 seconds?

I now use multiprocessing, but I think to remember the above holds as well for non-multiprocessing.


Solution

  • time.sleep stops your script from running for certain amount of seconds, while the timeout is the maximum time wait for retrieving the url. If the data is retrieved before the timeout time is up, the remaining time will get skipped. So it's possible to take less than 10 seconds using timeout.

    time.sleep is different, it pauses your script completely until it's done sleeping, then it will run your request taking another few seconds. So time.sleep will take more than 10 seconds every time.

    They have very different uses, but for your case, you should make a timer so if it finished before 10 seconds, make the program to wait.