Search code examples
web-crawlerstormcrawler

Speed up the crawling process


Working on ES 6.5.x and storm crawler 1.10. How can I speed up the crawler to fetch the records.When i check the metrics on its shows an average of 0.4 pages per second. Is there anything do I need to change in the below crawler config. enter image description here

Crawler-Conf:

config: 
  topology.workers: 2
  topology.message.timeout.secs: 300
  topology.max.spout.pending: 100
  topology.debug: false
  fetcher.server.delay: .25
  fetcher.threads.number: 200
  fetcher.threads.per.queue: 5

  worker.heap.memory.mb: 2048

  topology.kryo.register:
    - com.digitalpebble.stormcrawler.Metadata

  http.content.limit: -1
  fetchInterval.default: 1440
  fetchInterval.fetch.error: 120
  fetchInterval.error: -1
  topology.metrics.consumer.register:
     - class: "org.apache.storm.metric.LoggingMetricsConsumer"
       parallelism.hint: 1

Solution

  • If you are crawling a single site then you don't need 2 workers or more than one ES shard and spout! all the URLs would be directed to a single shard anyway!

    You are using 5 threads per queue but only retrieving 2 URLs per bucket from ES (es.status.max.urls.per.bucket: 2) and forcing 2 secs between calls to ES (spout.min.delay.queries: 2000) so on average the spout can't produce more than 1 URL per second. Also refresh_interval in ES_IndexInit.sh impacts how quickly changes are visible in the index and therefore how likely you are to get fresh URLs from the request.

    Simply change es.status.max.urls.per.bucket to a larger value e.g. 10 and drop spout.min.delay.queries to the same value as refresh_interval in ES_IndexInit.sh e.g. 1 sec. This will get you a lot more URLs.