Search code examples
apache-stormstormcrawler

can stormcrawler have different status index for each topology?


I'm crawling about 20 domains and eventually scale to 300. Each domain has its own parser config, submitted as individual topologies.

So when using a single status index, all the topologies seem to pick up urls randomly without being specific to a particular domain.

Hence will having a separate status index for each topology solve the issue ? are there any other approaches to this ?

Also I cannot use a single topology for all domains as the crawl rates are different, also time of crawl and each domain is far different than the other.


Solution

  • you can have one index per crawl, however, if you want to run 1 topology per domain, it would be a lot simpler to simply add 1 arbitrary metadata to the seed of the crawl and make sure that it gets transferred to the outlinks. You can then use one filter query for each topology so that the spout gets URLs for that crawl only. The metadata key could be something like crawlID for instance.

    Also I cannot use a single topology for all domains as the crawl rates are different, also time of crawl and each domain is far different than the other.

    There is probably a way around this. Having a single topology would make things a lot simpler.