Search code examples
web-crawlerstormcrawler

Stormcrawler's ContentParseFilter


If I set StormCrawler's ContentParseFilter to be

"pattern": "//DIV[@id=\"site-body\"]",

does that mean that that is the ONLY place it will look for links to other pages when processing each url? I am wondering if I set that if it will start ignoring all the urls in the menus and such.

Thanks! Jim


Solution

  • See WIKI page for ParseFilters

    The ContentFilter allows to restrict the text of a document to the text covered by a Xpath expression

    it does not affect the extraction of links at all but aims at improving the text indexed.