Search code examples
elasticsearchexplain

Understand elasticsearch query explain


I'm trying to understand the Explain API scoring in the elastic documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

When I couldn't figure it out on my own simple index with just a couple documents, I tried to reproduce the calculation on the above documentation page.

In the example, it shows a "value" of 1.3862944 with the description: "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))". Under "details" it gives the following values for the fields: docFreq: 1.0, docCount: 5.0

Using the provided docFreq and docCount values, I compute this to be: log(1 + (5.0 - 1.0 + 0.5) / (1.0 + 0.5)) = 0.602 which is not the same as the 1.3862944 in the example.

I can't get any of the values to match up.

Am I reading it incorrectly?

Below is the entire post

GET /twitter/_doc/0/_explain   
{ 
  "query" : {
    "match" : { "message" : "elasticsearch" }
  }
}

This will yield the following result:

{
   "_index": "twitter",
   "_type": "_doc",
   "_id": "0",
   "matched": true,
   "explanation": {
       "value": 1.6943599,
       "description": "weight(message:elasticsearch in 0) [PerFieldSimilarity], result of:",
       "details": [
       {
        "value": 1.6943599,
        "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
        "details": [
           {
              "value": 1.3862944,  <== This is the one I am trying
              "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
              "details": [
                 {
                    "value": 1.0,
                    "description": "docFreq",
                    "details": []
                 },
                 {
                    "value": 5.0,
                    "description": "docCount",
                    "details": []
                  }
               ]
           },
            {
              "value": 1.2222223,
              "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
              "details": [
                 {
                    "value": 1.0,
                    "description": "termFreq=1.0",
                    "details": []
                 },
                 {
                    "value": 1.2,
                    "description": "parameter k1",
                    "details": []
                 },
                 {
                    "value": 0.75,
                    "description": "parameter b",
                    "details": []
                 },
                 {
                    "value": 5.4,
                    "description": "avgFieldLength",
                    "details": []
                 },
                 {
                    "value": 3.0,
                    "description": "fieldLength",
                    "details": []
                 }
              ]
           }
        ]
     }
  ]
}
}

Solution

  • The explanation as always is quite accurate, let me help you to understand those calculations:

    This is the initial formula:

    log(1 + (5.0 - 1.0 + 0.5) / (1.0 + 0.5))
    

    Next step would be:

    log(1 + 4.5 / 1.5)
    

    One more:

    log(4) = ?
    

    and here comes the tricky part. You treat this log as log by the base of 10. However, if you would take a look in the code of Lucene scorer you would find that it's an ln, which would be exactly the 1.386294

    Part of the code:

    public float idf(long docFreq, long numDocs) {
        return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
      }
    

    where Math.log definition is the following:

    public static double log(double a)
    
    Returns the natural logarithm (base e) of a double value.