I am using lucene 5.2.1, and I want to implement my own ranking rule. In order to make a correct rule, I need to know weather the default relevance score of Lucene (the one that you get with a normal search) is always between 0 and + infinite.
You can find a simple code below to get a better idea.
Query query = new Query(...); //some query, for example "name:foo"
int maxdocs = 500;
TopDocs topBusiness = searchEngine.search(query, maxdocs);
ScoreDoc[] hits = topBusiness.scoreDocs;
float score = hits[0].score;
I want to be sure that the variable score cannot be something below 0 (e.g. score=0.00003 would be ok, but score=-1 would not).
Anybody knows?
It is possible to have a score less than 0 in a search result! All you really need to do is set a negative boost to see it in action (example demonstrating negative score).
If you can comfortably assume that you will never have to deal with a negative boost (which I would say, is usually pretty safe). Then you should be safe in assuming all score will, likewise, be positive.
To explain, score is:
score(q,d) = coord(q,d) · queryNorm(q) · ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )
To have a negative score, at least one thing in there has to be negative.
tf - Can't be negative. A term can't appear in a document -1 times, so term frequency must be 0 or higher.
idf2 - This is a little trickier. idf is:
1 + log ( numDocs / (docFreq + 1))
numDocs and docFreq are both positive, since otherwise we're making silly claims like our index has negative 1 terms. So, the logarithm is of a positive number, which is good, because who wants to deal with imaginary numbers?
We find the idf can be negative of positive, but will be real (not imaginary). Since it's being squared in the score calculation, then we again have a guaranteed positive number. Wolfram Alpha might illustrate this better
coord - coord measures the overlap of available query terms vs matched query terms. Since the least number of terms matched is 0, and you won't query with less than zero terms, this will be positive.
queryNorm - This normalization factor is probably the hardest to get, and the least interesting in practice:
1 / ( q.getBoost()2 · ∑ ( idf(t) · t.getBoost() )2 )1/2
For our purposes, we can see right off, everything is getting squared, so unless it's imaginary, the result is positive. Boosts won't be imaginary, and we established above that idf won't be imaginary, so again, we're looking at positive number here.
t.getBoost() - This is where negative scores come into the picture, but we're assuming this is positive for our purposes now.
norm - Norms encode a length normalization, and index-time field boosts. We're assuming positive boosts, so the norm will also be positive. That's kinda of academic though. A bit of testing shows that the default norm encoding (which is only one byte in length) doesn't really support negatives. Any negative number encoded and decoded will come out 0.0.
So, only negative boosts will render negative scores. Nothing else in the scoring algorithm will.