Search code examples
elasticsearchweb-crawlernutchstormcrawler

Storm-crawler crawl and indexing


I've worked with Nutch 1x for crawling websites and using Elasticsearch to index the data. I've come across Storm-crawler recently and like it, especially the streaming nature of it.

Do I have to init and create the mappings for my ES server that Storm-crawler is sending the data to?

With Nutch, as long as I had the ES index up and running, the mapping took care of itself... except for some fine tuning. Is it the same for Stormcrawler? Or do I have to init the index and mapping before?


Solution

  • Great to hear you like StormCrawler.

    As explained in README and the video tutorial based on ES2.x, you should use the ES_IndexInit script to set the mapping explicitly. It probably works without it but it would not be optimal.