When writing a web crawler/scraper, what algorithms and techniques are available to throttle requests and avoid DoS'ing the server/getting banned? This comes up often when reading about web scraping (for example, here), but always as something like "I should have implemented throttling, but didn't" :)
My Google-fu may be weak, as I found mostly discussions on how to throttle requests server-side, and others (like this question) are specific about some library.
The most generic, cross-language way is to sleep between requests. Something like 10 seconds sleep should mimic how fast a real human goes through web pages. To avoid robot identifying algorithms some people sleep a random amount of time: sleep(ten_seconds + rand())
.
You can make it fancier by keeping track of different sleep timeouts for each domain so that you can fetch something from another server while waiting for the sleep timeout.
The second method is to actually try to reduce the bandwidth for your request. You may need to write your own http client with this feature to do it. Or on linux you can use the network stack to do it for you - google qdisc
.
You can of course combine both methods.
Note that reducing bandwidth is not very friendly to sites with lots of small resources. That's because you're increasing the amount of time you're connected for each resource hence occupying one network socket and probably one web server thread while you're at it.
On the other hand not reducing bandwidth is not very friendly to sites with lots of large resources like mp3 files or videos. That's because you're saturating their network - switches, routers, ISP connection - by downloading as fast as they can serve.
A smart implementation will download small files at full speed, sleeping between downloads, but reduce bandwidth for large files.