So I have a an identifier
string field in elastic search that contains values like D123
, M1
, T23
etc.
I am trying to build autocomplete into the search for this field such that a query of D12
might match D12
, D120
, D121
, ..., D1210
etc.
Currently I have constructed a custom edge ngram filter and analyzer as such:
"filter": {
"autocomplete_filter": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 10
}
}
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "whitespace",
"filter": {"lowercase", "autocomplete_filter"}
}
}
And in my mapping I use this on the identifier
field when indexing:
"identifier": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard"
}
This means the ngrams that are indexed for D1234
are D1
, D12
, D123
and D1234
.
To query this I am doing as follows:
"query": {
"bool": {
"should": {
"match": {
"identifier": {
"query": "D12",
"fuzziness": 0
}
}
}
}
}
This returns results from longest to shortest, so that D12
appears at the end of the results. How would I go about ensuring the shortest possible identifier has the highest relevance score?
My guess is that the D12
query is matching the ngrams like so: [{D12}, {D12}3, {D12}34]
and elastic search goes "Oh great, 3 matches!" rather than the 1 [{D12}]
that the D12
result would give.
I guess one solution might be not partially matching those ngrams so that elastic search sees [{D12}]
for both results but ranks D12
higher than D1234
since it matched 1/2 of the ngrams rather than 1/4. I'm not sure how to configure elastic search to give this result though.
Any help would be much appreciated.
You can do this with script based sorting, but first you need to map you identifier
field as multi-fields
like this
"identifier": {
"type": "string",
"analyzer": "autocomplete",
"search_analyzer": "standard",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
You need to do this because if you sort
directly on identifier
then you will get same results because all of them will be having 2 letter tokens due to edge ngram filter
. After that this will give you desired results
{
"query": {
"bool": {
"should": {
"match": {
"identifier": {
"query": "D12",
"fuzziness": 0
}
}
}
}
},
"sort": {
"_script": {
"script": "doc['identifier.raw'].value.length()",
"order": "asc",
"type": "number"
}
}
}
Hope this helps!!