Search code examples
web-crawlerapache-stormstormcrawler

what is the use of bucket number in storm crawler?


while crawling multiple websites by using partition "host",the partition key also called as bucket is generated based on the host. And each spout instance is given a bucket to fetch urls. what happens if i crawl only one website ? In this case i have only one bucket which means only one instance of the spout will access my bucket ? And incase of crawling many websites if all urls from one bucket is crawled .will the spout instance move to next bucket or not?


Solution

  • if you crawl one site then yes, only one spout will be active. If you crawl many sites, they will be distributed across multiple buckets and an equal number of spout instances will be active. When there are no more URLs to fetch for a shard then the corresponding spout will not send URLs down the topology. The other spout instances will continue processing URLs until there aren't any more to do.