lucene/solr norm: avoid short length fields to rank inappropriately high

Using norm when indexing is great, my problem is that very short fields rank inappropriately high. Example:

doc1 : tf(200) out of 1.000 
doc2 : tf(150) out of 500

doc2 will score higher and its great.

Problem is when I have:

doc3 : tf(3) out of 4

which is not great in my case because it's a very rare document, let's say an exception.

I've read KinoSearch or someone suggesting to introduce a constant to kind of offset this issue. Any ideas on how I can still leverage full power of using norm and avoid this issue?

Thanks

Solution

You can create your own Similarity class, extending DefaultSimilarity, and simply override the lengthNorm method. The default lengthNorm implementation is pretty simple really:

public float lengthNorm(FieldInvertState state) {
    final int numTerms;
    if (discountOverlaps)
        numTerms = state.getLength() - state.getNumOverlap();
    else
        numTerms = state.getLength();
    return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
}

Replace it with whatever algorithm makes sense in your case. Really, the last line there is probably all you really need to worry about modifying, particularly 1.0 / Math.sqrt(numTerms). Two things to keep in mind here:

Norms are compressed in a very lossy fashion (about 1 significant decimal digit!) to conserve space. Big differences matter, minor tweaks will tend to get lost.
You will need to re-index. Norms are stored at index time, rather than calculated at query time.

You can set Solr to use your Similarity in your schema, like:

<similarity class="this.is.my.CustomSimilarity"/>