How we can disable inject sub domain in streaming?
Now, if we inject www.ebay.com
in stream than in out we have subdomain pages: my.ebay.com
, community.ebay.com
, ...
You can configure HostURLFilter to exclude URLs which are outside the seeds hostnames, by setting ignoreOutsideHost to true in urlfilters.json
{
"class": "com.digitalpebble.stormcrawler.filtering.host.HostURLFilter",
"name": "HostURLFilter",
"params": {
"ignoreOutsideHost": true,
"ignoreOutsideDomain": true
}
}