I am migrating from ES 1.7 to ES 6.5. The data sources is common but when I search any particular keyword it returned different scores and resulting into returning different set as one with max score is selected.
I used '_explain' in elastic to understand the query score calculation details. I have shared query and explanation for same keyword in both the index.
Query used :
{
"explain": true,
"query": {
"function_score": {
"query": {
"match": {
"search": {
"query": "san"
}
}
},
"functions": [
{
"field_value_factor": {
"field": "related.score"
}
}
]
}
},
"from": 0,
"size": 1
}
Mappings for ES 1.7
{
"_id": {
"path": "search"
},
"properties": {
"related": {
"properties": {
"category": {
"type": "long"
},
"score": {
"type": "double"
},
"search": {
"type": "string"
}
}
},
"search": {
"type": "string",
"analyzer": "english"
}
}
}
Explanation for Query in ES 1.7 :
{
"_explanation": {
"value": 4.83643,
"description": "function score, product of:",
"details": [
{
"value": 4.8384395,
"description": "weight(search:san in 11405) [PerFieldSimilarity], result of:",
"details": [
{
"value": 4.8384395,
"description": "fieldWeight in 11405, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 4.8384395,
"description": "idf(docFreq=1072, maxDocs=49844)"
},
{
"value": 1,
"description": "fieldNorm(doc=11405)"
}
]
}
]
},
{
"value": 0.99958473,
"description": "Math.min of",
"details": [
{
"value": 0.99958473,
"description": "field value function: (doc['related.score'].value * factor=1.0)"
},
{
"value": 3.4028235e+38,
"description": "maxBoost"
}
]
},
{
"value": 1,
"description": "queryBoost"
}
]
}
}
Mappings for ES 6.5
{
“mappings”: {
“searches”: {
“properties”: {
“related”: {
“properties”: {
“category”: {
“type”: “long”
},
“score”: {
“type”: “double”
},
“search”: {
“type”: “text”
}
}
},
“search”: {
“type”: “text”,
“analyzer”: “english”
}
}
}
}
Explanation for Query in ES 6.5 :
{
"_explanation": {
"value": 5.1439505,
"description": "function score, product of:",
"details": [
{
"value": 5.1460876,
"description": "weight(search:san in 2464) [PerFieldSimilarity], result of:",
"details": [
{
"value": 5.1460876,
"description": "score(doc=2464,freq=1.0 = termFreq=1.0\n), product of:",
"details": [
{
"value": 3.82669,
"description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details": [
{
"value": 5419,
"description": "docFreq",
"details": []
},
{
"value": 248810,
"description": "docCount",
"details": []
}
]
},
{
"value": 1.3447882,
"description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details": [
{
"value": 1,
"description": "termFreq=1.0",
"details": []
},
{
"value": 1.2,
"description": "parameter k1",
"details": []
},
{
"value": 0.75,
"description": "parameter b",
"details": []
},
{
"value": 2.679008,
"description": "avgFieldLength",
"details": []
},
{
"value": 1,
"description": "fieldLength",
"details": []
}
]
}
]
}
]
},
{
"value": 0.99958473,
"description": "min of:",
"details": [
{
"value": 0.99958473,
"description": "field value function: none(doc['related.score'].value * factor=1.0)",
"details": []
},
{
"value": 3.4028235e+38,
"description": "maxBoost",
"details": []
}
]
}
]
}
}
If we have look on both the explanation score calculation has changed is different in both version of ES leading to different score. size=1 in the query hence it should return record with max score but as score calculation method changed it returns different score for same keyword in ES 1.7 and ES 6.5 resulting into different keyword with max score.
Can someone please help me to find how we can get the same scores ?
There are several changes in these two versions and the main is how score is calculated in ES 1.7(tf/idf) and how it's changed to BM25 in ES6.X.
It also depends on how many shards you have in your index. as the score is calculated is local to shard.
IMO, getting the same score in both these versions, even after a number of primary shards for huge no of documents might be really difficult. what you should aim is checking the order(ie if same document was in top 5 earlier it should still be in top 5 or 10 so) is not changed significantly for same search queries.