I have this set, which crawls pages based on the seed
{ "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter", "name": "HostURLFilter", "params": { "ignoreOutsideHost": false, "ignoreOutsideDomain": true } }
but, how can I limit to just subpages of the seed. For eg. If I have a seed as "https://www.test.com/", with the above settings, the crawler also crawls and adds urls like "https://stg.test.com/" and its subpages etc.
How can I limit the crawl, to "https://www.test.com/" and just subpages of this seed, like "https://www.test.com/test1", "https://www.test.com/test2" etc.
TIA.
Simply set ignoreOutsideHost to true in the config of the HostUrlFilter.