Search code examples
elasticsearchscoring

Improve score if the field starts with the term


I'm trying to do an efficient auto-complete search input on my website, to search cities. I assume that people will always start to search their city name, with the right order of words. E.g. a user who live in Saint-Maur will type sai.. but will never type mau.. in first place.

I need to improve the score of results, if the result starts with the term from the query. E.g. if a user type pari, the city Parigné-le-Pôlin should have a better score than Fontenay-en-Parisis, since it starts with pari.

I'm using an edge-gram filter, and a phrase match because the order of words matters. I'm sure that my problem has a simple solution, but I'm a newb in the ES magic world :)

Here is my mapping:

{
    "settings": {
        "index": {
            "number_of_shards": 1
        },

        "analysis": {
            "analyzer": {
                "partialPostalCodeAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["partialFilter"]
                },
                "partialNameAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["asciifolding", "lowercase", "word_delimiter", "partialFilter"]
                },
                "searchAnalyzer": {
                    "tokenizer": "standard",
                    "filter": ["asciifolding", "lowercase", "word_delimiter"]
                }
            },

            "filter": {
                "partialFilter": {
                    "type": "edge_ngram",
                    "min_gram": 1,
                    "max_gram": 50
                }
            }
        }
    },

    "mappings": {
        "village": {
            "properties": {
                "postalCode": {
                    "type": "string",
                    "index_analyzer": "partialPostalCodeAnalyzer",
                    "search_analyzer": "searchAnalyzer"
                },

                "name": {
                    "type": "string",
                    "index_analyzer": "partialNameAnalyzer",
                    "search_analyzer": "searchAnalyzer"
                },

                "population": {
                    "type": "integer",
                    "index": "not_analyzed"
                }
            }
        }
    }
}

Some sample:

PUT /tv_village/village/1 {"name": "Paris"}
PUT /tv_village/village/2 {"name": "Parigny"}
PUT /tv_village/village/3 {"name": "Fontenay-en-Parisis"}
PUT /tv_village/village/4 {"name": "Parigné-le-Pôlin"}

If I perform this query, you can see that results are not in the order I want them to be (I want the 4th result to be before the 3d one):

GET /tv_village/village/_search
{
  "query": {
    "match_phrase": {
      "name": "pari"
    }
  }
}

Results:

      "hits": [
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "1",
            "_score": 0.7768564,
            "_source": {
               "name": "Paris"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "2",
            "_score": 0.7768564,
            "_source": {
               "name": "Parigny"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "3",
            "_score": 0.3884282,
            "_source": {
               "name": "Fontenay-en-Parisis"
            }
         },
         {
            "_index": "tv_village",
            "_type": "village",
            "_id": "4",
            "_score": 0.3884282,
            "_source": {
               "name": "Parigné-le-Pôlin"
            }
         }
      ]

Solution

  • In your mapping definition, put another analyzer:

                "keywordLowercaseAnalyer": {
                  "tokenizer": "keyword",
                  "filter": ["lowercase"]
                }
    

    meaning, keep the word intact (through keyword analyzer) and lowercase it (like "parigné-le-pôlin"). Then define for your name field another two fields:

    • one raw that should be not_analyzed
    • one raw_lowercase that should use keywordLowercaseAnalyer

      "name": {
        "type": "string",
        "index_analyzer": "partialNameAnalyzer",
        "search_analyzer": "searchAnalyzer",
        "fields": {
          "raw": {
            "type": "string",
            "index": "not_analyzed"
          },
          "raw_lowercase": {
            "type": "string",
            "analyzer": "keywordLowercaseAnalyer"
          }
        }
      }
      

    I'm doing this because you can have searches for "pari" or "Pari". In your query, use the rescore functionality to recompute the scoring based on an additional query:

    {
      "query": {
        "match_phrase": {
          "name": "pari"
        }
      },
      "rescore": {
        "query": {
          "rescore_query": {
            "bool": {
              "should": [
                {"prefix": {"name.raw": "pari"}},
                {"prefix": {"name.raw_lowercase": "pari"}}
              ]
            }
          }
        }
      }
    }
    

    There are two drawbacks, from your use case point of view and regarding prefix query:

    • it is quite resource intensive
    • the value passed to a prefix is not_analyzed and this is the reason for adding those two raw* fields: one field deals with a lowercase version, the other deals with the untouched version so that queries for "pari" or "Pari" cover these scenarios.

    I have two suggestions:

    • test the query above on your real data to see how it behaves, performance wise
    • play with window_size attribute for rescore query to limit the number of values the rescoring is performed on, thus improving the performance.

    For your reference, this is the documentation page for rescore.