Search code examples
web-crawlerstormcrawler

Limit the crawl to subpages of the seed url


I have this set, which crawls pages based on the seed


{ "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter", "name": "HostURLFilter", "params": { "ignoreOutsideHost": false, "ignoreOutsideDomain": true } }


but, how can I limit to just subpages of the seed. For eg. If I have a seed as "https://www.test.com/", with the above settings, the crawler also crawls and adds urls like "https://stg.test.com/" and its subpages etc.

How can I limit the crawl, to "https://www.test.com/" and just subpages of this seed, like "https://www.test.com/test1", "https://www.test.com/test2" etc.

TIA.


Solution

  • Simply set ignoreOutsideHost to true in the config of the HostUrlFilter.