lucene ranking with booleanquery - determining quality of hits

I am using a booleanquery constructed of termqueries, all on the same field, that are all set on 'SHOULD' at the moment.

I have tried to figure out how the ranking of the ScoreDoc[] result object works for this query, but haven't been able to find the right documentation, maybe you can help with the following questions:

1) Will a booleanquery rank hits that match all terms higher than hits that only match single terms?

2) Is there a way to determine which termquery was matched and which was not for the a resulting scoredoc object?

Thanks for the help!

Solution

A boolean query does rank hits on multiple query terms more highly than those that only match one, but keep in mind, that is only one part of the scoring algorithm. There are a number of other impacts that could wash that out.

Query terms combined by a boolean query have their sub-scores multiplied together to form the final score, so more query terms matching will naturally be weighed more heavily. On top of that, there is a coord factor, which is larger when a larger proportion of the query terms are matched, which is also multiplied into the score.

However, multiple matches of the same query term, document length, term rarety, and boosts also impact the score, and it's quite possible to have documents that, even though they don't match all terms, get a higher score from these impacts.

See the TFIDFSimilarity docs for details on the algorithm in use here.

To understand the scoring of a document for your query, you should get familiar with Explanation. You can get a human readable explanation of why a document was scored the way it was like:

Explanation explain = searcher.explain(myQuery, resultDocNo);
System.out.print(explain.ToString());

To identify the fragments of the documents which matched the query, you can use Highlighter, a simple use of which might be:

QueryScorer scorer = new QueryScorer(myQuery);
Highlighter highlighter = new Highlighter(scorer);
String fragment = highlighter.getBestFragment(analyzer, fieldName, myDoc.getField(fieldName));