Search code examples
nutch

Is it possible to have different fetch interval in Nutch?


Is it possible to use different fetch interval for each URL that I have listed or group of URLs?

If not, is there a command that I can use to fetch a URL whenever I want (this way I could use a cron job or a daemon)?


Solution

  • If the fetch interval is set for a seed URL (that is defined on the seed file) you could use the metadata portion of the inject step (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Injector.java#L69-L72) this way you can control how your seed links will be fetched. However the discovered links will have their own scheduling, but perhaps you can write something that propagates the nutch.fetchInterval or nutch.fetchInterval.fixed to the outlinks of your seed files so all the links on the same host will have the same fetch interval (or your own algorithm).

    Said this you also can write your own custom fetch schedule (similar to the ones bundled with Nutch: mimetype/default/adaptative) that implements your custom logic.