Working on ES 6.5.x and storm crawler 1.10. How can I speed up the crawler to fetch the records.When i check the metrics on its shows an average of 0.4 pages per second. Is there anything do I need to change in the below crawler config.
Crawler-Conf:
config:
topology.workers: 2
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.server.delay: .25
fetcher.threads.number: 200
fetcher.threads.per.queue: 5
worker.heap.memory.mb: 2048
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
http.content.limit: -1
fetchInterval.default: 1440
fetchInterval.fetch.error: 120
fetchInterval.error: -1
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1
If you are crawling a single site then you don't need 2 workers or more than one ES shard and spout! all the URLs would be directed to a single shard anyway!
You are using 5 threads per queue but only retrieving 2 URLs per bucket from ES (es.status.max.urls.per.bucket: 2) and forcing 2 secs between calls to ES (spout.min.delay.queries: 2000) so on average the spout can't produce more than 1 URL per second. Also refresh_interval in ES_IndexInit.sh impacts how quickly changes are visible in the index and therefore how likely you are to get fresh URLs from the request.
Simply change es.status.max.urls.per.bucket to a larger value e.g. 10 and drop spout.min.delay.queries to the same value as refresh_interval in ES_IndexInit.sh e.g. 1 sec. This will get you a lot more URLs.