Search code examples
lucenescoring

When indexing, what are the factors that can affect a term's score when searched


The question is little confusing. I am new to Lucene and going through the documents. I found out that adding boost to a field, increases the norm of the field and thus, increases the score of the term when its searched.

I.E. adding boost to a field at indexing time can affect the score at search time. My question is are there any other ways, other than boosting, to do the same? Please advice.


Solution

  • Before Lucene 4.x, there used to be a single scoring formula based on Vector Space Model.

    The following are the factors that contribute to the Lucene scoring.

    1) Tf : term frequency, i.e. frequency of a term in a document.

    2) Idf : Inverse document frequency : log(Collection Size / Number of documents that have term) " This formula may vary.

    3) Field Boost : the one you've mentioned. It's provided while Indexing.

    4) Coord : a score factor based on how many of the query terms are found in the specified document.

    5) queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable

    6) norm(t,d) encapsulates a few (indexing time) boost and length factors:

    a) Document boost - set by calling doc.setBoost() before adding the document to the index.

    b) Field boost - set by calling field.setBoost() before adding the field to a document.

    c) lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.

    7) Term boost: is a search time boost of term t in the query q

    For in-depth knowledge of Lucene's default scoring formula : Check the documentation : Lucene Similarity

    With the new release of Lucene 4.x, new scoring formulas have been introduced like BM25. For more details, please check the subclasses of Lucene 4.2 Similarity

    You can implement a subclass of Similarity to customize all the above scoring factors. Here is an Example...