Search code examples
web-crawlernutchstormcrawler

Prioritizing recursive crawl in Storm Crawler


When crawling the world wide web, I would want to give my crawler an initial seed list of URLs - and would expect my crawler to automatically 'discover' new seed URLs from internet during it's crawling.

I see such option in Apach Nutch (see topN parameter in generate command of nutch). Is there any such option in Storm Crawler as well?


Solution

  • StormCrawler can handle recursive crawls, and the way URLs are prioritized depends on the backend used for storing the URLs.

    For instance the Elasticsearch module can be used for that, see the README for a short tutorial and the sample config file, where by default the spouts will sort URLs based on their nextFetchDate (**.sort.field*).

    In Nutch, the -topN argument only specifies the max number of URLs to put in the next segment (based on the scores provided by whichever scoring plugin is used). With StormCrawler we don't really need an equivalent as things are not processed by batches, the crawl runs continuously.