search elasticsearch autocomplete full-text-search search-engine

Autocomplete matching in Elastic Search

So I have a an identifier string field in elastic search that contains values like D123, M1, T23 etc.

I am trying to build autocomplete into the search for this field such that a query of D12 might match D12, D120, D121, ..., D1210 etc.

Currently I have constructed a custom edge ngram filter and analyzer as such:

"filter": {
  "autocomplete_filter": {
    "type": "edgeNGram",
    "min_gram": 2,
    "max_gram": 10
  }
}

"analyzer": {
  "autocomplete": {
      "type": "custom",
      "tokenizer": "whitespace",
      "filter": {"lowercase", "autocomplete_filter"}
  }
}

And in my mapping I use this on the identifier field when indexing:

"identifier": {
  "type": "string",
  "analyzer": "autocomplete",
  "search_analyzer": "standard"
}

This means the ngrams that are indexed for D1234 are D1, D12, D123 and D1234.

To query this I am doing as follows:

"query": {
  "bool": {
    "should": {
      "match": {
        "identifier": {
          "query": "D12",
          "fuzziness": 0
        }
      }
    }
  }
}

This returns results from longest to shortest, so that D12 appears at the end of the results. How would I go about ensuring the shortest possible identifier has the highest relevance score?

My guess is that the D12 query is matching the ngrams like so: [{D12}, {D12}3, {D12}34] and elastic search goes "Oh great, 3 matches!" rather than the 1 [{D12}] that the D12 result would give.

I guess one solution might be not partially matching those ngrams so that elastic search sees [{D12}] for both results but ranks D12 higher than D1234 since it matched 1/2 of the ngrams rather than 1/4. I'm not sure how to configure elastic search to give this result though.

Any help would be much appreciated.

Solution

You can do this with script based sorting, but first you need to map you identifier field as multi-fields like this

"identifier": {
    "type": "string",
    "analyzer": "autocomplete",
    "search_analyzer": "standard",
    "fields": {
        "raw": {
            "type": "string",
            "index": "not_analyzed"
        }
    }
}

You need to do this because if you sort directly on identifier then you will get same results because all of them will be having 2 letter tokens due to edge ngram filter. After that this will give you desired results

{
  "query": {
    "bool": {
      "should": {
        "match": {
          "identifier": {
            "query": "D12",
            "fuzziness": 0
          }
        }
      }
    }
  },
  "sort": {
    "_script": {
      "script": "doc['identifier.raw'].value.length()",
      "order": "asc",
      "type": "number"
    }
  }
}

Hope this helps!!