How to deep crawl with nutch

I'm currently crawling 28 sites (small small, small large) and the crawls are generating about 25MBs of data. I'm indexing with Elasticsearch and using an edge_n-gram strategy for autocomplete. After some testing, it seems I need more data to create better multi-word (phrase) suggestions. I know I can simply crawl more sites but is there a way to enable Nutch to crawl each site completely or as much as possible to create more data for better search suggestions via edge_n_grams?

Is this even a lost cause and no matter how much data I have, is the best way to create better multi-word suggestions by logging users search queries?

Solution

You could always increase the amount of links that you want to crawl, if you're using the bin/crawl command you could just increase the number of iterations or modify the script and increase the sizeFetchlist parameter (https://github.com/apache/nutch/blob/master/src/bin/crawl#L117). This parameter is just use as the topN argument in the conventional bin/nutch script.

Keep in mind that this options are available also on the 2.x branch.

What kind of suggestions are you trying to accomplish? In an app I've developed sometime ago we use a combination of both approachs (we were using Solr instead of elasticsearch but the essence is the same) we indexed the user queries in a separated collection/index and in this we configured an EdgeNGramFilterFactory (Solr's equivalent to edge_n_grams of ES) this provided some basic query suggestions based on what users had already searched. When no suggestions could be found using this approach we try to suggest single terms based on the content of the crawled content, this required some javascript tweaking in the frontend.

Not sure that using the edge_n_grams on the whole textual content of a webpage could be that helpful mainly because NGrams for the whole content would be created and suggestions wouldn't be that relevant, due to the great number of matches, but I don't know your specific use case.