Search code examples
sortingluceneperformance-testinglucene.net

Lucene search performance - DOC vs DEFAULT sorting


We've ran some tests with Lucene.Net 3.0.3 in terms of searching and inserting.

For the testing we used a keyword analyzer and text generator based on real English words.

When the index hits around 8 million documents, while doing a search of 1000 random sentences, it takes 25 minutes for the search to complete. (Default sorting)

If we change the search to document sorting:

searcher.Search(query, null, int.MaxValue, new Sort(new SortField(null, SortField.DOC, true))); // boolean query

The search only takes a few seconds to complete.

What gives? Is the the default sorting based on relevance? Why does it have such a huge impact?

Also, if we reduce the number of hits from int.MaxValue to lets say 50, it also reduces the search to a few seconds only.

Is only taking only the first 50 hits it finds in the index and disregarding the rest?


Solution

  • I believe your guess is right, If you sort in doc id order, it doesn't need to score every matching document, but instead can short circuit once it's found enough documents that match, as opposed to score sorting where each document needs to be scored to know which are the best matches.

    Seems like the question you should be asking, though, is why is it taking so long to search anyway?

    Based on what you've written, I think I can guess: You should not be using KeywordAnalyzer for full text. Sounds like you are indexing full text as keywords, then searching for sentences, probably using double wildcards, or regexes, or something like that. Stop doing that. You might as well just forget Lucene, and code up a good old fashioned sequential search, since that is what you are forcing Lucene to do anyway. Use an analyzer that actually serves your search needs (StandardAnalyzer or EnglishAnalyzer are good starting points), and use phrase queries to search for phrases or sentences.