I'm trying to do an efficient auto-complete search input on my website, to search cities. I assume that people will always start to search their city name, with the right order of words.
E.g. a user who live in Saint-Maur
will type sai..
but will never type mau..
in first place.
I need to improve the score of results, if the result starts with the term from the query. E.g. if a user type pari
, the city Parigné-le-Pôlin
should have a better score than Fontenay-en-Parisis
, since it starts with pari
.
I'm using an edge-gram filter, and a phrase match because the order of words matters. I'm sure that my problem has a simple solution, but I'm a newb in the ES magic world :)
Here is my mapping:
{
"settings": {
"index": {
"number_of_shards": 1
},
"analysis": {
"analyzer": {
"partialPostalCodeAnalyzer": {
"tokenizer": "standard",
"filter": ["partialFilter"]
},
"partialNameAnalyzer": {
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase", "word_delimiter", "partialFilter"]
},
"searchAnalyzer": {
"tokenizer": "standard",
"filter": ["asciifolding", "lowercase", "word_delimiter"]
}
},
"filter": {
"partialFilter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 50
}
}
}
},
"mappings": {
"village": {
"properties": {
"postalCode": {
"type": "string",
"index_analyzer": "partialPostalCodeAnalyzer",
"search_analyzer": "searchAnalyzer"
},
"name": {
"type": "string",
"index_analyzer": "partialNameAnalyzer",
"search_analyzer": "searchAnalyzer"
},
"population": {
"type": "integer",
"index": "not_analyzed"
}
}
}
}
}
Some sample:
PUT /tv_village/village/1 {"name": "Paris"}
PUT /tv_village/village/2 {"name": "Parigny"}
PUT /tv_village/village/3 {"name": "Fontenay-en-Parisis"}
PUT /tv_village/village/4 {"name": "Parigné-le-Pôlin"}
If I perform this query, you can see that results are not in the order I want them to be (I want the 4th result to be before the 3d one):
GET /tv_village/village/_search
{
"query": {
"match_phrase": {
"name": "pari"
}
}
}
Results:
"hits": [
{
"_index": "tv_village",
"_type": "village",
"_id": "1",
"_score": 0.7768564,
"_source": {
"name": "Paris"
}
},
{
"_index": "tv_village",
"_type": "village",
"_id": "2",
"_score": 0.7768564,
"_source": {
"name": "Parigny"
}
},
{
"_index": "tv_village",
"_type": "village",
"_id": "3",
"_score": 0.3884282,
"_source": {
"name": "Fontenay-en-Parisis"
}
},
{
"_index": "tv_village",
"_type": "village",
"_id": "4",
"_score": 0.3884282,
"_source": {
"name": "Parigné-le-Pôlin"
}
}
]
In your mapping definition, put another analyzer:
"keywordLowercaseAnalyer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
meaning, keep the word intact (through keyword
analyzer) and lowercase it (like "parigné-le-pôlin").
Then define for your name
field another two fields:
raw
that should be not_analyzed
one raw_lowercase
that should use keywordLowercaseAnalyer
"name": {
"type": "string",
"index_analyzer": "partialNameAnalyzer",
"search_analyzer": "searchAnalyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"raw_lowercase": {
"type": "string",
"analyzer": "keywordLowercaseAnalyer"
}
}
}
I'm doing this because you can have searches for "pari" or "Pari". In your query, use the rescore
functionality to recompute the scoring based on an additional query:
{
"query": {
"match_phrase": {
"name": "pari"
}
},
"rescore": {
"query": {
"rescore_query": {
"bool": {
"should": [
{"prefix": {"name.raw": "pari"}},
{"prefix": {"name.raw_lowercase": "pari"}}
]
}
}
}
}
}
There are two drawbacks, from your use case point of view and regarding prefix
query:
prefix
is not_analyzed
and this is the reason for adding those two raw*
fields: one field deals with a lowercase version, the other deals with the untouched version so that queries for "pari" or "Pari" cover these scenarios.I have two suggestions:
window_size
attribute for rescore
query to limit the number of values the rescoring is performed on, thus improving the performance.For your reference, this is the documentation page for rescore
.