Search code examples
regexweb-crawlerstormcrawler

Applying a Regex Filter to Crawler to crawl specific pages


I am using storm crawler 1.10 and Elastic Search 6.3.x. For Example I have a main website https://www.abce.org and it has subpages like https://abce.org/def and https://abce.org/ghi. I want to crawl specifically the pages under https://www.abce.org/ghi.

My seed Url is https://www.abce.org/ghi/.

Currently I applied below different regex filters at each time.

  1. +^https:\/\/www.abce.org\/ghi*
  2. +^(?:https?:\/\/)www.abce.org\/ghi(.+)*$
  3. +^(?:https?:\/\/)?(?:www\.)?abce\.[a-zA-Z0-9.\S]+$

I tested my regex expressions regexr its shows valid. But when I check on statusindex its displaying only discovered seed url and nothing else.


Solution

  • Try the FastURLFilter which you might find more intuitive to use. Run the topology in debug mode to check that you do have URLs submitted to the URLFilters and that they behave as you expect.

    Before you ask, here's a tip on debugging Storm