apache web-crawler apache-storm stormcrawler

StormCrawler settings

I have a few questions about StormCrawler: http://stormcrawler.net/

1. Deny the crawling of Shops:

I tried to use this regex in the default-regex-filters.txt: -(shop). Is this the right way to do this? Because stormcrawler still crawls websites which have "shop" somewhere in their URL..

2. What does the "maxDepth" Parameter do?

I would need the possibility to limit the crawling-depth per Website, e.g. just crawl the pages which are "one click/level" away from the /home site. Is this the right parameter for that use case? If no where is the option for this?

3. Elasticsearch: Discovered & Fetched

I would understand that discovered should always be bigger than fetched. But i got cases where fetched > discovered. Is there an explanation for it or what does discovered and what does fetched mean exactly?

4. Configuration entry: parse.emitOutlinks

I dont really understand the meaning of it. Is there a simple explanation for it? Because when i set it to false, the crawler did only crawl the first page of an URL and i dont know why.

5. Difference between "fetcherthreads" and "threads per Queue"?

We currently use 200 fetcherthreads and 20 threads per queue. How are these two in proportion?

Sorry for that amount of questions, but i would really appreciate your help. Thank you in advance!

Regards,

Jojo

Solution

1. Deny the crawling of Shops

-.*(shop) should work. The expression you tried does not allow for any characters before shop

2. What does the "maxDepth" Parameter do?

yes, this is exactly what it does. It tracks the depth from the seed URLs and filters anything beyond the threshold you set

3. Elasticsearch: Discovered & Fetched

See Why do I have different document counts in status and index?

Why not have a look at the tutorials and WIKI?

4. Configuration entry: parse.emitOutlinks

As the name suggests, this parameter prevents the parser bolt to add outlinks to the status stream. This is useful when you don't want to expand a crawl and fetch only the seeds.

5. Difference between "fetcherthreads" and "threads per Queue"?

Fetcher threads are simple the number of threads used within a FetcherBolt to fetch URLs. The FetcherBolt places incoming URLs into internal queues based on their hostname (or domain or IP) and the fetcher threads poll from these queues. By default, StormCrawler allows only one fetcher thread per internal queue so that the crawl is polite and does not send requests to the target hosts too frequently.

If you haven't done so already, I'd recommend that you look at the video tutorials on Youtube.