Search code examples
regexstormcrawler

Stormcrawler and regex when parsing rules in the default-regex-filters.txt?


If I need to crawl just ONE host in a domain while still crawling the rest of our sites, what's the regex to put in the default-regex-filters.txt to accomplish that?

I am trying to block all hosts at https://*.bar.com while allowing ONLY https://foo.bar.com

Can I do a generalized -^https?://.*\.bar\.com.* rule followed by a specific rule allowing the one host +^https?://foo\.bar\.com.* Will that work?

I tried to do a complicated -^https?://([a-eg-zA-EG-Z0-9] type of thing to block everything but foo but it seems much simpler to just negate everything and add back the one I actually want...


Solution

  • The most specific rules should come first indeed, see code

    https://github.com/DigitalPebble/storm-crawler/blob/399cdac2125c39ef9be26586a2ca2609f92b0988/core/src/main/java/com/digitalpebble/stormcrawler/filtering/regex/RegexURLFilterBase.java#L156

    The fastURLFilter https://github.com/DigitalPebble/storm-crawler/wiki/URLFilters follows the same logic but could be simpler to organize.