Search code examples
web-crawlerstormcrawler

Parallel Processing of New Domain/URL inserted in StormCrawler using ElasticSearch


I am using StormCrawler for Live Crawling. I am inserting Domain in ElasticSearch and Crawler is crawling fine, I have defined a limit of crawling URls for each Domain ( Using Redis in SimpleFetcherBolt).

Scenario : When I insert a domain, StormCrawler starts Crawling. Now enter a new Domain in ElasticSeeds, StormCrawler does not fetch it immediately.

It is busy in Fetching Pages of previous Domain. If the limit is high (say 1000 URLS), It takes 20 minutes atleast to start crawling on newly inserted domain.

I want results instant, Is there any priority one can set on new domain ? or StormCrawler starts crawling on new domain Whenever new domain gets inserted? Different queue (DB) for each domain ?

Any Suggestions would be appreciated.


Solution

  • I have defined a limit of crawling URls for each Domain ( Using Redis in SimpleFetcherBolt)

    could you please explain what you mean by that? You should not have to modify the Fetcher bolt, that's what URL filters are for.

    What type of spout are you using? AggregationSpouts? How many instances of SimpleFetcherBolt are you using?

    SC should start crawling on a new domain pretty quickly. Please set the log level accordingly and check the logs to see whether the spouts have emitted tuples for the new domains and whether the URLs are blocked further down.

    EDIT: either specify more than one instance of SimpleFetcherBolt or use FetcherBolt instead. With a single instance of SFB the URLs will be stuck in the queue whereas FetcherBolt will process them in parallel.

    By limit I meant SC only fetches limited URLS of one domain after that It stops fetching. say Limit is 100, SC will fetch 100 Urls of each Domain

    Maybe do that as a separate URL filter, this will be a lot cleaner than hacking the fetcher class, it should also be more efficient.

    SC uses Agg.Spout by default right

    No, see ESCrawlTopology