Search code examples
searchelasticsearchautocompletefull-text-searchsearch-engine

Autocomplete matching in Elastic Search


So I have a an identifier string field in elastic search that contains values like D123, M1, T23 etc.

I am trying to build autocomplete into the search for this field such that a query of D12 might match D12, D120, D121, ..., D1210 etc.

Currently I have constructed a custom edge ngram filter and analyzer as such:

"filter": {
  "autocomplete_filter": {
    "type": "edgeNGram",
    "min_gram": 2,
    "max_gram": 10
  }
}

"analyzer": {
  "autocomplete": {
      "type": "custom",
      "tokenizer": "whitespace",
      "filter": {"lowercase", "autocomplete_filter"}
  }
}

And in my mapping I use this on the identifier field when indexing:

"identifier": {
  "type": "string",
  "analyzer": "autocomplete",
  "search_analyzer": "standard"
}

This means the ngrams that are indexed for D1234 are D1, D12, D123 and D1234.

To query this I am doing as follows:

"query": {
  "bool": {
    "should": {
      "match": {
        "identifier": {
          "query": "D12",
          "fuzziness": 0
        }
      }
    }
  }
}

This returns results from longest to shortest, so that D12 appears at the end of the results. How would I go about ensuring the shortest possible identifier has the highest relevance score?

My guess is that the D12 query is matching the ngrams like so: [{D12}, {D12}3, {D12}34] and elastic search goes "Oh great, 3 matches!" rather than the 1 [{D12}] that the D12 result would give.

I guess one solution might be not partially matching those ngrams so that elastic search sees [{D12}] for both results but ranks D12 higher than D1234 since it matched 1/2 of the ngrams rather than 1/4. I'm not sure how to configure elastic search to give this result though.

Any help would be much appreciated.


Solution

  • You can do this with script based sorting, but first you need to map you identifier field as multi-fields like this

    "identifier": {
        "type": "string",
        "analyzer": "autocomplete",
        "search_analyzer": "standard",
        "fields": {
            "raw": {
                "type": "string",
                "index": "not_analyzed"
            }
        }
    }
    

    You need to do this because if you sort directly on identifier then you will get same results because all of them will be having 2 letter tokens due to edge ngram filter. After that this will give you desired results

    {
      "query": {
        "bool": {
          "should": {
            "match": {
              "identifier": {
                "query": "D12",
                "fuzziness": 0
              }
            }
          }
        }
      },
      "sort": {
        "_script": {
          "script": "doc['identifier.raw'].value.length()",
          "order": "asc",
          "type": "number"
        }
      }
    }
    

    Hope this helps!!