Search code examples
elasticsearchsimilarity

Elasticsearch changing similarity does not work


Changing the similarity algorithm of my index does not work. I wan't to compare BM25 vs. TF-IDF, but i always get the same results. I'm using Elasticsearch 5.x.

I have tried literally everything. Setting the similarity of a property to classic or BM25 or don't set anything

"properties": {
           "content": {
              "type": "text",
              "similarity": "classic"
           },

I also tried setting the default similarty of my index in the settings and using it in the properties

"settings": {
     "index": {
        "number_of_shards": "5",
        "provided_name": "test",
        "similarity": {
           "default": {
              "type": "classic"
           }
        },
        "creation_date": "1493748517301",
        "number_of_replicas": "1",
        "uuid": "sNuWcT4AT82MKsfAB9JcXQ",
        "version": {
           "created": "5020299"
        }
     }

The query im testing looks something like this:

{
  "query": {
    "match": {
      "content": "some search query"
    }
  }
}

Solution

  • I have created a sample below:

    DELETE test
    PUT test
    {
      "mappings": {
        "book": {
          "properties": {
            "content": {
              "type": "text",
              "similarity": "BM25"
            },
            "subject": {
              "type": "text",
              "similarity": "classic"
            }
          }
        }
      }
    }
    
    POST test/book/1
    {
      "subject": "A neutron star is the collapsed core of a large (10–29 solar masses) star. Neutron stars are the smallest and densest stars known to exist.[1] Though neutron stars typically have a radius on the order of 10 km, they can have masses of about twice that of the Sun.",
      "content": "A neutron star is the collapsed core of a large (10–29 solar masses) star. Neutron stars are the smallest and densest stars known to exist.[1] Though neutron stars typically have a radius on the order of 10 km, they can have masses of about twice that of the Sun."
    }
    POST test/book/2
    {
      "subject": "A quark star is a hypothetical type of compact exotic star composed of quark matter, where extremely high temperature and pressure forces nuclear particles to dissolve into a continuous phase consisting of free quarks. These are ultra-dense phases of degenerate matter theorized to form inside neutron stars exceeding a predicted internal pressure needed for quark degeneracy.",
      "content": "A quark star is a hypothetical type of compact exotic star composed of quark matter, where extremely high temperature and pressure forces nuclear particles to dissolve into a continuous phase consisting of free quarks. These are ultra-dense phases of degenerate matter theorized to form inside neutron stars exceeding a predicted internal pressure needed for quark degeneracy."
    }
    
    GET test/_search?explain
    {
      "query": {
        "match": {
          "subject": "neutron"
        }
      }
    }
    GET test/_search?explain
    {
      "query": {
        "match": {
          "content": "neutron"
        }
      }
    }
    

    subject and content fields have different similarities definitions but in the two documents I provided (from wikipedia) they have the same text in them. Running the two queries you will see in the explanations something like this and also get different scores in results:

    • from the first query: "description": "idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:"
    • from the second one: "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",