Search code examples
elasticsearchsearchlucenefull-text-searchopensearch

Searching documents in Elasticsearch with most common terms with my query but also with least uncommon terms


I am struggling with an Elasticsearch query over a ngrams field. I am trying to fetch those documents that have a value in that field whose ngram tokens are the most similiar to the ones of my query input. To be precise, I want that the non-matching tokens are counted negatively in the score of a given document. If I am query "current assets" I would like that documents like "my current assets" are scored higher than "This is the current period report of current assets. The second document has more terms in common with my query ("current" appears twice) but it has also more uncommon terms ("This","is"...). How to make Elasticseach to take into consideration the uncommon terms when scoring?

I have tried to check the explanation of how the scores are calculated for different documents. Weirdly it shows that for bot documents, when calculating the score of the same term, they obtain different values because IDF is different. In other words, it shows that the same term (i.e. "curr") is a different number of documents when asking for explanation for different documents. How this is even possible? Also, I have only Elasticsearch node.


Solution

  • To be precise, I want that the non-matching tokens are counted negatively in the score of a given document.

    This cannot be done directly since information about non-matched tokens is not easily available during searching, however, you have access to overall document length so you can introduce penalty for the length of the document. In the default algorithm (BM25) it is controlled by the parameter b. See similarity module docs for more information.

    How to make Elasticseach to take into consideration the uncommon terms when scoring?

    If by "common" you mean matching, the matching terms between your query and the document is the most important part of the scoring. We would need to take a look at the explanation to tell you exactly why you are getting unexpected results. Using term "common" while talking about scoring is a bit confusing since "common" term can be misunderstood as "frequent" term in terms of IDF.

    In other words, it shows that the same term (i.e. "curr") is a different number of documents when asking for explanation for different documents. How this is even possible?

    By default IDF is calculated on each shard independently. With a lot of document this numbers tend to be similar on different shards, but if a small number of documents and rare terms you might have some discrepancy. The simplest way to fix that is by using a single shard or by setting search_type to dfs_query_then_fetch. See documentation for more information on this topic.

    Sorry for generic answers, but most of the complains in your question can be attributed to the issues with DFS. If after switching to a single shard or using dfs_query_then_fetch you still have concrete issues, please open a new question with some examples of the documents and queries as well as print out of scoring explanation and description of why specifically you are not satisfied with the scores that you have got.