Search code examples
web-crawlernutchddos

Crawl delay based on IP address vs hostname vs domain name


For example in the case of crawling stackoverflow, it makes sense to delay based on hostname/domain name (e.g. send a request to stackoverflow.com every 10 minutes)

In the case of *.blogspot.com it only makes sense to delay requests based on domain name since there are millions of hostnames ending with .blogspot.com and delaying based on that will flood the server with millions of requests.

When crawling a wide range of websites (web scale crawls) what is the best practice in terms of imposing delays between requests? Should I delay requests based on IP address, hostname or domain name?


Solution

  • It is good practice to partition by IP with Nutch. The generation step takes a bit longer because of the IP resolution but you'll get a guarantee that the Fetcher will behave politely while at the same time keeping good performance. The politeness settings from robots.txt will be enforced anyway.

    I have done multi billion page crawls with Nutch and from experience grouping URLs by IP is the best option. The last thing you want is to be blacklisted by websites or worse have AWS (or whichever cloud provider you are running on) kicking you out. Many webmasters do not even know about robots.txt and will feel very defensive it they perceive your crawler as abusive - even if you intend to crawl politely. The larger the scale, the more cautious you should be.