Search code examples
solrlucenenormalization

lucene/solr norm: avoid short length fields to rank inappropriately high


Using norm when indexing is great, my problem is that very short fields rank inappropriately high. Example:

doc1 : tf(200) out of 1.000 
doc2 : tf(150) out of 500

doc2 will score higher and its great.

Problem is when I have:

doc3 : tf(3) out of 4

which is not great in my case because it's a very rare document, let's say an exception.

I've read KinoSearch or someone suggesting to introduce a constant to kind of offset this issue. Any ideas on how I can still leverage full power of using norm and avoid this issue?

Thanks


Solution

  • You can create your own Similarity class, extending DefaultSimilarity, and simply override the lengthNorm method. The default lengthNorm implementation is pretty simple really:

    public float lengthNorm(FieldInvertState state) {
        final int numTerms;
        if (discountOverlaps)
            numTerms = state.getLength() - state.getNumOverlap();
        else
            numTerms = state.getLength();
        return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
    }
    

    Replace it with whatever algorithm makes sense in your case. Really, the last line there is probably all you really need to worry about modifying, particularly 1.0 / Math.sqrt(numTerms). Two things to keep in mind here:

    • Norms are compressed in a very lossy fashion (about 1 significant decimal digit!) to conserve space. Big differences matter, minor tweaks will tend to get lost.
    • You will need to re-index. Norms are stored at index time, rather than calculated at query time.

    You can set Solr to use your Similarity in your schema, like:

    <similarity class="this.is.my.CustomSimilarity"/>