Search code examples
web-crawlerapache-stormstormcrawler

What is the proper way to loop discovered urls back to fetch them?


I've started with the default topology but want to do a recursive crawl. So I have to modify the flux file to loop discovered urls back to the fetcher, and I'm not sure which is the best way to do this?

Is there a good sample of how to do this? Maybe working with Elasticsearch?

Regards, Chris


Solution

  • You need to store the information about the URLs when running a recursive crawl. Feeding back to the Fetcher is not enough as it won't take into account duplicates or give you any control on scheduling.

    There is a number of options available in the external modules, Elasticsearch is one of them, you can also use SOLR or a SQL backend.

    See our Youtube channel for tutorials on how to use SC with ES.

    There is an implementation of StatusUpdater which feeds the discovered URLs back to to the MemorySpout but this is merely useful for testing / debugging in local mode.