Search code examples
web-crawlerapache-stormstormcrawler

Will my spout thread stay idle in storm crawler after processing all the urls in the bucket allocated to it?


1) What happens when the number of buckets in database is more than the number of threads? 2)What happens when there is urls in one bucket only but there are 10 spout threads..will the remaining 9 threads stay idle?


Solution

  • You should set the number of spout instances to be the same as the number of buckets. If there are more buckets than spout instances, only the buckets with a number lower or equal to the number of instances will get queried.

    All the spout instances send queries to the db by specifying a bucket number. If the corresponding bucket contains URLs, these will be sent down the topology, if not the spout instance will try querying it again later after a short period of idleness.

    The code of the SQL spout is pretty straightforward.