I have a problem with Lucene's scoring function that I can't figure out. So far, I've been able to write this code to reproduce it.
package lucenebug;
import java.util.Arrays;
import java.util.List;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
public class Test {
private static final String TMP_LUCENEBUG_INDEX = "/tmp/lucenebug_index";
public static void main(String[] args) throws Throwable {
SimpleAnalyzer analyzer = new SimpleAnalyzer();
IndexWriter w = new IndexWriter(TMP_LUCENEBUG_INDEX, analyzer, true);
List<String> names = Arrays
.asList(new String[] { "the rolling stones",
"rolling stones (karaoke)",
"the rolling stones tribute",
"rolling stones tribute band",
"karaoke - the rolling stones" });
try {
for (String name : names) {
System.out.println("#name: " + name);
Document doc = new Document();
doc.add(new Field("name", name, Field.Store.YES,
Field.Index.TOKENIZED));
w.addDocument(doc);
}
System.out.println("finished adding docs, total size: "
+ w.docCount());
} finally {
w.close();
}
IndexSearcher s = new IndexSearcher(TMP_LUCENEBUG_INDEX);
QueryParser p = new QueryParser("name", analyzer);
Query q = p.parse("name:(rolling stones)");
System.out.println("--------\nquery: " + q);
TopDocs topdocs = s.search(q, null, 10);
for (ScoreDoc sd : topdocs.scoreDocs) {
System.out.println("" + sd.score + "\t"
+ s.doc(sd.doc).getField("name").stringValue());
}
}
}
The output I get from running it is:
finished adding docs, total size: 5
--------
query: name:rolling name:stones
0.578186 the rolling stones
0.578186 rolling stones (karaoke)
0.578186 the rolling stones tribute
0.578186 rolling stones tribute band
0.578186 karaoke - the rolling stones
I just can't understand why the rolling stones
has the same relevance as the rolling stones tribute
. According to lucene's documentation, the more tokens a field has, the smaller the normalization factor should be, and therefore the rolling stones tribute
should have a lower score than the rolling stones
.
Any ideas?
The length normalization factor is calculated as 1 / sqrt(numTerms)
(You can see this in DefaultSimilarity
This result is not stored in the index directly. This value is multiplied by the boost value for the field specified. The final result is then encoded in 8 bits as explained in Similarity.encodeNorm() This is a lossy encoding, which means fine details get lost.
If you want to see length normalization in action, try creating document with following sentence.
the rolling stones tribute a b c d e f g h i j k
This will create sufficient difference in the length normalization values which you could see.
Now if your field have very few tokens as per the examples you have used, you could set boost values for the documents/fields based on your own formula which is essentially higher boost for short field. Alternatively, you could create custom Similarity and override legthNorm() method.