Search code examples
web-crawlerstormcrawler

Stormcrawler not indexing content with Elasticsearch


When using Stormcrawler it is indexing to Elasticsearch, but not the content.

Stormcrawler is up-to-date with 'origin/master' https://github.com/DigitalPebble/storm-crawler.git

Using elasticsearch-5.6.4

crawler-conf.yaml has

indexer.url.fieldname: "url" indexer.text.fieldname: "content" indexer.canonical.name: "canonical"

The url and title fields are indexed, but not content.

I have trying to get this working by following Julien's tutorial at: https://www.youtube.com/watch?v=xMCuWpPh-4A

Everything is working, except for the content is not being indexed into Elasticsearch. I feel like this is some small config error, but I have tried many variations with no luck. So, now I seek help.

Thanks.


Solution

  • Are you sure that the content is not indexed? The content field is not stored, see ES_IndexInit.sh but it should be indexed. To store it, you can modify the init script and re-run the crawl, you'd then get it back same as the other fields. To test that it is indexed, try querying on it and see how it affects the results.