Search code examples
elasticsearchn-gram

ElasticSearch how to manage the score result in ngram query?


I have hundreds of chemicals results in my index climate_change

I'm using a ngram research and this is the settings that I'm using for the index.

{
  "settings": {
    "index.max_ngram_diff": 30,
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer": {
            "tokenizer": "test_ngram",
            "filter": [
              "lowercase"
            ]
          },
          "search_analyzer": {
            "tokenizer": "test_ngram",
            "filter": [
              "lowercase"
            ]
          }
        },
        "tokenizer": {
          "test_ngram": {
            "type": "edge_ngram",
            "min_gram": 1,
            "max_gram": 30,
            "token_chars": [
              "letter",
              "digit"
            ]
          }
        }
      }
    }
  }
}

My main problem is that if I try to do a query like this one

GET climate_change/_search?size=1000
{
  "query": {
    "match": {
      "description": {
        "query":"oxygen"
      }
    }
  }
}

I see that a lot of results have the same score 7.381186..but it's strange

     {
        "_index" : "climate_change",
        "_type" : "_doc",
        "_id" : "XXX",
        "_score" : 7.381186,
        "_source" : {
          "recordtype" : "chemicals",
          "description" : "carbon/oxygen"
        }
      },
      {
        "_index" : "climate_change",
        "_type" : "_doc",
        "_id" : "YYY",
        "_score" : 7.381186,
        "_source" : {
          "recordtype" : "chemicals",
          "description" : "oxygen"
        }

How could it be possible? In the example above, If I'm using ngram and I'm searching oxygen in the description field, I'll expect that the second result will have a score bigger than the first one. I've also tried to specify the type of the tokenizer "standard" and "whitespace" in the settings, but it could not help. Maybe is the '/' character inside the description?

Thanks a lot!


Solution

  • You need to define the analyzer in the mapping for the description field also.

    Adding a working example with index data, mapping, search query and search result

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "test_ngram",
              "filter": [
                "lowercase"
              ]
            },
            "search_analyzer": {
              "tokenizer": "test_ngram",
              "filter": [
                "lowercase"
              ]
            }
          },
          "tokenizer": {
            "test_ngram": {
              "type": "edge_ngram",
              "min_gram": 1,
              "max_gram": 30,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "description": {
            "type": "text",
            "analyzer": "my_analyzer"
          }
        }
      }
    }
    

    Index Data:

    {
      "recordtype": "chemicals",
      "description": "carbon/oxygen"
    }
    {
      "recordtype": "chemicals",
      "description": "oxygen"
    }
    

    Search Query:

    {
      "query": {
        "match": {
          "description": {
            "query":"oxygen"
          }
        }
      }
    }
    

    Search Result:

    "hits": [
          {
            "_index": "67180160",
            "_type": "_doc",
            "_id": "2",
            "_score": 0.89246297,
            "_source": {
              "recordtype": "chemicals",
              "description": "oxygen"
            }
          },
          {
            "_index": "67180160",
            "_type": "_doc",
            "_id": "1",
            "_score": 0.6651374,
            "_source": {
              "recordtype": "chemicals",
              "description": "carbon/oxygen"
            }
          }
        ]