Search code examples
solrsimilaritysolrcloudtf-idf

IDF similarity across shards does not work as expected, uses only local shard info


I am using solr(7.3) for my groceries products data. I found strange results due to idf on data across multiple shards(3 shards).

My search keyword was "milk"

Milk is not really rare keyword in my collection. But, in one of the shards, there are very few documents(1-2 docs among 9000) containing keyword milk. So in that shard(shard1) the idf score is very high, almost 3 times the score from other shards. Which is affecting my result. I am not expecting that specific document from shard1 to be as top result.

Is there any way to control idf scoring as we can do for tf in BM25 with k1 and b parameters?

Or do we have BM25 without idf similarity? I can create my own similarity and use it but our solr services does not allow customising solr.

Or is there any other way to solve this?


Solution

  • You can use a different statsCache to get support for distributed IDF. The default option (localStatsCache) only uses values from the local shard, but you can change it to one of the distributed options to make Solr use a collection wide idf when calculating scores instead.

    Document and term statistics are needed in order to calculate relevancy. Solr provides four implementations out of the box when it comes to document stats calculation:

    LocalStatsCache: This only uses local term and document statistics to compute relevance. In cases with uniform term distribution across shards, this works reasonably well. This option is the default if no is configured.

    ExactStatsCache: This implementation uses global values (across the collection) for document frequency.

    ExactSharedStatsCache: This is exactly like the exact stats cache in its functionality but the global stats are reused for subsequent requests with the same terms.

    LRUStatsCache: This implementation uses an LRU cache to hold global stats, which are shared between requests.

    The implementation can be selected by setting in solrconfig.xml. For example, the following line makes Solr use the ExactStatsCache implementation:

    <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>