From my understanding the basic example should be able to crawl and fetch pages.
I followed the example on http://stormcrawler.net/getting-started/ but the crawler seems to only fetch a few pages and then does nothing more.
I wanted to crawl http://books.toscrape.com/ and ran the crawl but saw in the logs that only the first page was fetched and some other were discovered but not fetched:
8010 [Thread-34-parse-executor[5 5]] INFO c.d.s.b.JSoupParserBolt - Parsing : starting http://books.toscrape.com/
8214 [Thread-34-parse-executor[5 5]] INFO c.d.s.b.JSoupParserBolt - Parsed http://books.toscrape.com/ in 182 msec
content 1435 chars
url http://books.toscrape.com/
domain toscrape.com
description
title All products | Books to Scrape - Sandbox
http://books.toscrape.com/catalogue/category/books/new-adult_20/index.html DISCOVERED Thu Apr 05 13:46:01 CEST 2018
url.path: http://books.toscrape.com/
depth: 1
http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html DISCOVERED Thu Apr 05 13:46:01 CEST 2018
url.path: http://books.toscrape.com/
depth: 1
http://books.toscrape.com/catalogue/category/books/thriller_37/index.html DISCOVERED Thu Apr 05 13:46:01 CEST 2018
url.path: http://books.toscrape.com/
depth: 1
http://books.toscrape.com/catalogue/category/books/academic_40/index.html DISCOVERED Thu Apr 05 13:46:01 CEST 2018
url.path: http://books.toscrape.com/
depth: 1
http://books.toscrape.com/catalogue/category/books/classics_6/index.html DISCOVERED Thu Apr 05 13:46:01 CEST 2018
url.path: http://books.toscrape.com/
depth: 1
http://books.toscrape.com/catalogue/category/books/paranormal_24/index.html DISCOVERED Thu Apr 05 13:46:01 CEST 2018
url.path: http://books.toscrape.com/
depth: 1
....
17131 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 6:partitioner URLPartitioner {}
17164 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 8:spout queue_size 0
17403 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 5:parse JSoupParserBolt {tuple_success=1, outlink_kept=73}
17693 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 3:fetcher num_queues 0
17693 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 3:fetcher fetcher_average_perdoc {time_in_queues=265.0, bytes_fetched=51294.0, fetch_time=52.0}
17693 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 3:fetcher fetcher_counter {robots.fetched=1, bytes_fetched=51294, fetched=1}
17693 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 3:fetcher activethreads 0
17693 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 3:fetcher fetcher_average_persec {bytes_fetched_perSec=5295.137813564571, fetched_perSec=0.10323113451016827}
17693 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928770 172.18.25.22:1024 3:fetcher in_queues 0
27127 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 6:partitioner URLPartitioner {}
27168 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 8:spout queue_size 0
27405 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 5:parse JSoupParserBolt {tuple_success=0, outlink_kept=0}
27695 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 3:fetcher num_queues 0
27695 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 3:fetcher fetcher_average_perdoc {}
27695 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 3:fetcher fetcher_counter {robots.fetched=0, bytes_fetched=0, fetched=0}
27695 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 3:fetcher activethreads 0
27696 [Thread-39] INFO o.a.s.m.LoggingMetricsConsumer - 1522928780 172.18.25.22:1024 3:fetcher fetcher_average_persec {bytes_fetched_perSec=0.0, fetched_perSec=0.0}
No configuration files were altered. Including the crawler-conf.yaml.
Also the flag parser.emitOutlinks
should be true as this is the default by crawler-default.yaml
In another project I also followed youtube tutorial regarding elasticsearch. Here I had also the problem that no pages at all were fetched and indexed.
Where could be the mistake that the crawler does not fetch any pages?
The topology generated by the artefact is merely an example and uses StdOutStatusUpdater, which simply dumps discovered URLs to the console. If you are running in local mode or with a single worker, you could use MemoryStatusUpdater as it will add discovered URLs to the MemorySpout and theses will be processed in turn.
Note that this won't persists the information about the URLs when you terminate the topology or if the topology crashes. Again, this is just for debugging and as an initial step with StormCrawler.
If you want the URLs to be persisted, you could use any of the persistence backends (SOLR/ Elasticsearch, SQL). Feel free to describe your issue with ES as a separate question.