I have been messing with this for a while now, and have not been able to sort out how the default-regex-filters.txt file for StormCrawler works.
In one example I need to limit the crawler to ONLY crawl items under https://www.example.com/dev and none of the other directories on that site. I put the rule
+.*\/dev\/.*
into the last line of the default-regex-filters.txt but it doesn't seem to work. I thought standard regex-rules apply, but it doesn't seem to be the case. one of the examples above had / without the \ before it and it was working? I am rather confused by that, and wondering if there's a cheat sheet for the regex in that file so i can build these easier.
As a followup, is it also true that only one + filter can be in the file? I vaguely remember reading that, but wanted to be sure.
You can have as many + filters in the files as you want.
The logic of the filtering is simply
public String filter(URL pageUrl, Metadata sourceMetadata, String url) {
for (RegexRule rule : rules) {
if (rule.match(url)) {
return rule.accept() ? url : null;
}
}
return null;
}
where accept indicates that the pattern has a +. If nothing matches the URL is filtered.
Could it be that you left
# accept anything else
+.
above the expression you added?
You might want to have a look at the FastURLFilter which is a probably more intuitive.