Lucene search scoring issue

I have two indexes created from directories "test1" and "test2". "test1" directory has "file1.java" whereas "test2" has two files "file1.java" and "file2.java" in it. "file1.java" is identical in both the directories. Let the indexes be index1 and index2 respectively.

Now when I analyze these two indexes using luke, I find that the scores for a keyword searched in the index1 is different than the score generated in index2. This keyword exists only in "file1.java".

Why are the scores different? Is there any way of indexing in Lucene by which I can force the scores to be the same?

Solution

Scores in lucene allow you compare the relevance of query results to a single query. They are not designed to allow you to compare results between different indexes, or between different queries, or save them and compare them to later runs. They are only valid with regards to the set of query results they are returned with and the current state of the index. See this article about Lucene Scores as Percentages for more on why it's a bad idea to use lucene scores in this way.

After all, Lucene is scored using a TF-IDF algorithm. You should expect IDF scores to be different in an index with more content. The TFIDFSimilarity documentation describes the scoring algorithm in some detail.

You are certainly able to use any of a number of Similarity implementations out there if you please, or create an implementation yourself.