I am doing a proof of concept with StormCrawler and Elasticsearch for crawling just a single domain, with a few hosts - one with many many pages. Is there a way to tell stormcrawler to not group all of the urls for a host or domain in a single spout?
I followed the Youtube tutorials in setting it up, and have the spout set to 10 parallelisms, but as far as I can tell via the storm UI it is using only 1. How do I make it spread the urls for a single domain or even a single host over all of the spouts?
Thanks! Jim
to partition the URLs per host, your config should have partition.url.mode: "byHost" which is the default value. This will put URLs belonging to different hosts into different shards and more spout instances will be used.
URLs from the same host are put in the same shards to enforce politeness. If you want to fetch from a host in parallel, you can simply set fetcher.threads.per.queue to a whichever value you want. This is acceptable if the website is your own but clearly impolite if it belongs to someone else. This will work fine even if you keep sharding per host.
Of course, you can disable routing altogether by setting es.status.routing to false. The URLs will be sharded by ES regardless of hostname and all the shards and spouts will be used. This implication, however, is that there is no strict control on politeness.