Search code examples
elasticsearchweb-crawlerstormcrawler

Getting StormCrawler to retrieve more body content from a web page and put it into Elasticsearch


I have a Proof of Concept Stormcrawler install pointing at one of our smaller university websites ( https://example.com - around 300 pages), and am having issues with the amount of info SC is pulling from the body content. This site has a ton of menus at the top of the page, and SC is only getting most of the way through extracting the menu content before it cuts off and never actually gets to the real body content of the page. Is there a way to tell SC to grab a larger amount of body content from the page? Or is the issue on the Elasticsearch side? I currently have the SC/ES install set up just like the tutorial you have posted.

Thanks! Jim


Solution

  • Probably due to the config of http.content.limit which has a value of 65K in the config generated by the artefact.

    You can set it to -1 so that the entire content is preserved.

    I noticed from a page of that site that the main content is in a MAIN element. You could configure the ContentParseFilter so that it extracts the text from these elements and use it as the text of the document if found. This way you won't be indexing text from the boilerplate into ES.