I'm currently trying to setup a combined query made of several match fuzzy queries. I noticed something I'd like to have an explanation for before I move to combining queries.
I have documents indexed such as the following in a single index text
:
{
"article": "someArticleName",
"articleInfo": "someInfo", // potentially missing if this matters
"userId": 2
}
If I run the following query:
{
"from":0,
"min_score":0.6,
"query":{
"bool":{
"filter":[
{"term":{"userId":{"value": 2}}}
],
"should": {"match":{"article":{"fuzziness":"AUTO","query":"1705aa"}}}
}
},
"size":20,
"sort":[{"_score":{"order":"desc"}}],
"explain": True
}
then I receive this as result:
{'took': 16,
'timed_out': False,
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 6, 'relation': 'eq'},
'max_score': 3.5664783,
'hits': [{'_index': 'text',
'_type': '_doc',
'_id': 'id-1',
'_score': 3.5664783,
'_source': {'id': 'id-1',
'article': '1705aa',
'indexName': 'text',
'currentVersion': 0,
'userId': 2,
'indexedUtc': '2022-05-23T07:47:48.6175402+00:00'}},
{'_index': 'text',
'_type': '_doc',
'_id': 'id-2',
'_score': 1.3915253,
'_source': {'id': 'id-2',
'article': '1705aa',
'articleInfo': 'someInfo',
'userId': 2,
'indexedUtc': '2022-05-23T09:57:11.8080429+00:00'}},
...
}
and this as an explanation:
{'description': 'sum of:',
'details': [{'description': 'weight(article:1705aa in 220) '
'[PerFieldSimilarity], result of:',
'details': [{'description': 'score(freq=1.0), computed as boost '
'* idf * tf from:',
'details': [{'description': 'boost',
'details': [],
'value': 2.2},
{'description': 'idf, computed as log(1 '
'+ (N - n + 0.5) / (n + '
'0.5)) from:',
'details': [{'description': 'n, number '
'of '
'documents '
'containing '
'term',
'details': [],
'value': 1},
{'description': 'N, total '
'number of '
'documents '
'with '
'field',
'details': [],
'value': 37}],
'value': 3.232121},
{'description': 'tf, computed as freq / '
'(freq + k1 * (1 - b + '
'b * dl / avgdl)) from:',
'details': [{'description': 'freq, '
'occurrences '
'of term '
'within '
'document',
'details': [],
'value': 1.0},
{'description': 'k1, term '
'saturation '
'parameter',
'details': [],
'value': 1.2},
{'description': 'b, length '
'normalization '
'parameter',
'details': [],
'value': 0.75},
{'description': 'dl, '
'length of '
'field',
'details': [],
'value': 1.0},
{'description': 'avgdl, '
'average '
'length of '
'field',
'details': [],
'value': 1.2972972}],
'value': 0.50156736}],
'value': 3.5664783}],
'value': 3.5664783},
{'description': 'match on required clause, product of:',
'details': [{'description': '# clause',
'details': [],
'value': 0.0},
{'description': 'userId:[2 TO 2]',
'details': [],
'value': 1.0}],
'value': 0.0}],
'value': 3.5664783}
{'currentVersion': 0,
'id': 'id-1',
'indexName': 'text',
'indexedUtc': '2022-05-23T07:47:48.6175402+00:00',
'article': '1705aa',
'userId': 2}
{'description': 'sum of:',
'details': [{'description': 'weight(article:1705aa in 0) '
'[PerFieldSimilarity], result of:',
'details': [{'description': 'score(freq=1.0), computed as boost '
'* idf * tf from:',
'details': [{'description': 'boost',
'details': [],
'value': 2.2},
{'description': 'idf, computed as log(1 '
'+ (N - n + 0.5) / (n + '
'0.5)) from:',
'details': [{'description': 'n, number '
'of '
'documents '
'containing '
'term',
'details': [],
'value': 5},
{'description': 'N, total '
'number of '
'documents '
'with '
'field',
'details': [],
'value': 20}],
'value': 1.3397744},
{'description': 'tf, computed as freq / '
'(freq + k1 * (1 - b + '
'b * dl / avgdl)) from:',
'details': [{'description': 'freq, '
'occurrences '
'of term '
'within '
'document',
'details': [],
'value': 1.0},
{'description': 'k1, term '
'saturation '
'parameter',
'details': [],
'value': 1.2},
{'description': 'b, length '
'normalization '
'parameter',
'details': [],
'value': 0.75},
{'description': 'dl, '
'length of '
'field',
'details': [],
'value': 1.0},
{'description': 'avgdl, '
'average '
'length of '
'field',
'details': [],
'value': 1.1}],
'value': 0.472103}],
'value': 1.3915253}],
'value': 1.3915253},
{'description': 'match on required clause, product of:',
'details': [{'description': '# clause',
'details': [],
'value': 0.0},
{'description': 'userId:[2 TO 2]',
'details': [],
'value': 1.0}],
'value': 0.0}],
'value': 1.3915253}
{'currentVersion': 0,
'id': 'id-2',
'indexName': 'text',
'indexedUtc': '2022-05-23T09:57:11.8080429+00:00',
'article': ' 1705aa ',
'articleInfo': 'someInfo'
'userId': 2}
Why do I get a score of 3.5664783
for the first document and 1.3915253
for the second?
They're located in the same index and are both an exact match of the fuzzy query. Number of documents used in _explanation
seem different and I don't understand why and how to get equal scores for both documents.
The response shows that you have 5 shards. Shards impact the relevance scoring. Your documents will be distributed among your shards and by default, Elasticsearch makes each shard responsible for producing its own scores. Hence, the number of documents used in the explanation is different for both the results.
Read more about impact of shards on scoring here and here
By default, Elasticsearch will use a search type called Query Then Fetch. The way it works is as follows:
To get more consistent score you can use DFS Query Then Fetch with your search query like this
GET /test_index/_search?search_type=dfs_query_then_fetch
It does the following: