Search code examples
web-crawlerstormcrawler

Disable subdomain in flow stormcrawler


How we can disable inject sub domain in streaming? Now, if we inject www.ebay.com in stream than in out we have subdomain pages: my.ebay.com, community.ebay.com, ...


Solution

  • You can configure HostURLFilter to exclude URLs which are outside the seeds hostnames, by setting ignoreOutsideHost to true in urlfilters.json

    {
      "class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
      "name": "HostURLFilter",
      "params": {
        "ignoreOutsideHost": true,
        "ignoreOutsideDomain": true
      }
    }