Search code examples
elasticsearchpartialscoringexact-match

ElasticSearch: Partial/Exact Scoring with edge_ngram & fuzziness


In ElasticSearch I am trying to get correct scoring using edge_ngram with fuzziness. I would like exact matches to have the highest score and sub matches have lesser scores. Below is my setup and scoring results.

settings: {
          number_of_shards: 1,
          analysis: {
             filter: {
                ngram_filter: {
                   type: 'edge_ngram',
                   min_gram: 2,
                   max_gram: 20
                }
             },
             analyzer: {
                ngram_analyzer: {
                   type: 'custom',
                   tokenizer: 'standard',
                   filter: [
                      'lowercase',
                      'ngram_filter'
                   ]
                }
             }
          }
       },
    mappings: [{
          name: 'voter',
          _all: {
                'type': 'string',
                'index_analyzer': 'ngram_analyzer',
                'search_analyzer': 'standard'
             },
             properties: {
                last: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   term_vector: 'yes',
                   index_analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },
                first: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   term_vector: 'yes',
                   index_analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },

             }

       }]

After doing a POST with first name "Michael" I do a query as below with changes "Michael", "Michae", "Micha", "Mich", "Mic", and "Mi".

GET voter/voter/_search
{
 "query": {
    "match": {
      "_all": {
        "query": "Michael",
        "fuzziness": 2,
        "prefix_length": 1
      }
    }
  }
}

My score results are:

-"Michael": 0.19535106
-"Michae": 0.2242768
-"Micha": 0.24513611
-"Mich": 0.22340237
-"Mic": 0.21408978
-"Mi": 0.15438235

As you can see the score results aren't getting as expected. I would like "Michael" to have the highest score and "Mi" to have the lowest

Any help would be appreciated!


Solution

  • One way to approach this problem would be to add raw version of text in your mapping like this

                       last: {
                           type: 'string',
                           required : true,
                           include_in_all: true,
                           term_vector: 'yes',
                           index_analyzer: 'ngram_analyzer',
                           search_analyzer: 'standard',
                           "fields": {
                                "raw": { 
                                   "type":  "string"  <--- index with standard analyzer
                                  }
                              }
                        },
                        first: {
                           type: 'string',
                           required : true,
                           include_in_all: true,
                           term_vector: 'yes',
                           index_analyzer: 'ngram_analyzer',
                           search_analyzer: 'standard',
                           "fields": {
                                "raw": { 
                                   "type":  "string"  <--- index with standard analyzer
                                  }
                              }
                        },
    

    You could also make it exact with index : not_analyzed

    Then you can query like this

    {
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "_all": {
                  "query": "Michael",
                  "fuzziness": 2,
                  "prefix_length": 1
                }
              }
            },
            {
              "match": {
                "last.raw": {
                  "query": "Michael",
                  "boost": 5
                }
              }
            },
            {
              "match": {
                "first.raw": {
                  "query": "Michael",
                  "boost": 5
                }
              }
            }
          ]
        }
      }
    }
    

    Documents that matches more clauses will be scored higher. You could specify boost according to your requirements.