If I need to crawl just ONE host in a domain while still crawling the rest of our sites, what's the regex to put in the default-regex-filters.txt to accomplish that?
I am trying to block all hosts at https://*.bar.com while allowing ONLY https://foo.bar.com
Can I do a generalized -^https?://.*\.bar\.com.*
rule followed by a specific rule allowing the one host +^https?://foo\.bar\.com.*
Will that work?
I tried to do a complicated -^https?://([a-eg-zA-EG-Z0-9]
type of thing to block everything but foo
but it seems much simpler to just negate everything and add back the one I actually want...
The most specific rules should come first indeed, see code
The fastURLFilter https://github.com/DigitalPebble/storm-crawler/wiki/URLFilters follows the same logic but could be simpler to organize.