Search code examples
web-crawlerapache-stormstormcrawler

StormCrawler's archetype topology does not fetch outlinks


From my understanding the basic example should be able to crawl and fetch pages.

I followed the example on http://stormcrawler.net/getting-started/ but the crawler seems to only fetch a few pages and then does nothing more.

I wanted to crawl http://books.toscrape.com/ and ran the crawl but saw in the logs that only the first page was fetched and some other were discovered but not fetched:

8010 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsing : starting http://books.toscrape.com/
8214 [Thread-34-parse-executor[5 5]] INFO  c.d.s.b.JSoupParserBolt - Parsed http://books.toscrape.com/ in 182 msec
content 1435 chars
url     http://books.toscrape.com/
domain  toscrape.com
description
title   All products | Books to Scrape - Sandbox
http://books.toscrape.com/catalogue/category/books/new-adult_20/index.html      DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html   DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/thriller_37/index.html       DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/academic_40/index.html       DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/classics_6/index.html        DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1

http://books.toscrape.com/catalogue/category/books/paranormal_24/index.html     DISCOVERED      Thu Apr 05 13:46:01 CEST 2018
        url.path: http://books.toscrape.com/
        depth: 1



....




17131 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      6:partitioner URLPartitioner           {}
17164 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      8:spout       queue_size               0
17403 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      5:parse       JSoupParserBolt          {tuple_success=1, outlink_kept=73}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     num_queues               0
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     fetcher_average_perdoc   {time_in_queues=265.0, bytes_fetched=51294.0, fetch_time=52.0}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     fetcher_counter          {robots.fetched=1, bytes_fetched=51294, fetched=1}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     activethreads            0
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     fetcher_average_persec   {bytes_fetched_perSec=5295.137813564571, fetched_perSec=0.10323113451016827}
17693 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928770        172.18.25.22:1024      3:fetcher     in_queues                0
27127 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      6:partitioner URLPartitioner           {}
27168 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      8:spout       queue_size               0
27405 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      5:parse       JSoupParserBolt          {tuple_success=0, outlink_kept=0}
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     num_queues               0
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     fetcher_average_perdoc   {}
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     fetcher_counter          {robots.fetched=0, bytes_fetched=0, fetched=0}
27695 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     activethreads            0
27696 [Thread-39] INFO  o.a.s.m.LoggingMetricsConsumer - 1522928780        172.18.25.22:1024      3:fetcher     fetcher_average_persec   {bytes_fetched_perSec=0.0, fetched_perSec=0.0}

No configuration files were altered. Including the crawler-conf.yaml. Also the flag parser.emitOutlinks should be true as this is the default by crawler-default.yaml

In another project I also followed youtube tutorial regarding elasticsearch. Here I had also the problem that no pages at all were fetched and indexed.

Where could be the mistake that the crawler does not fetch any pages?


Solution

  • The topology generated by the artefact is merely an example and uses StdOutStatusUpdater, which simply dumps discovered URLs to the console. If you are running in local mode or with a single worker, you could use MemoryStatusUpdater as it will add discovered URLs to the MemorySpout and theses will be processed in turn.

    Note that this won't persists the information about the URLs when you terminate the topology or if the topology crashes. Again, this is just for debugging and as an initial step with StormCrawler.

    If you want the URLs to be persisted, you could use any of the persistence backends (SOLR/ Elasticsearch, SQL). Feel free to describe your issue with ES as a separate question.