Search code examples
web-crawlerstormcrawler

Clarification on how Stormcrawler's default-regex-filters.txt works


With Stormcrawler, if I add -^(http|https):\/\/example.com\/page\/?date to the default-regex-filters.txt but I still see

2019-03-20 08:49:58.110 c.d.s.b.JSoupParserBolt Thread-5-parse-executor[7 7] [INFO] Parsing : starting https://example.com/page/?date=1999-9-16&t=list
2019-03-20 08:49:58.117 c.d.s.b.JSoupParserBolt Thread-5-parse-executor[7 7] [INFO] Parsed https://example.com/page/?date=1999-9-16&t=list in 6 msec

in the logs, yet no documents show up in the index. Is Stormcrawler avoiding the url, or is it still fetching it, or is it just retrieving a url from the status table and then evaluating it?


Solution

  • The filtering is applied on the outlinks post-parsing, the 'surviving' URLs are sent to the status updater bolt. It affects the discovery of URLs, in other words, if a URL is sent by a spout, it will be processed.