Search code examples
elasticsearchelasticsearch-dsl

Why do elasticsearch queries require a certain number of characters to return results?


It seems like there is a character minimum needed to get results with elasticsearch for a specific property I am searching. It is called 'guid' and has the following configuration:

    "guid": {
        "type": "text",
        "fields": {
            "keyword": {
                "type": "keyword",
                "ignore_above": 256
            }
        }
    }

I have a document with the following GUID: 3e49996c-1dd8-4230-8f6f-abe4236a6fc4

The following query returns the document as-expected:

{"match":{"query":"9996c-1dd8*","fields":["guid"]}}

However this query does not:

{"match":{"query":"9996c-1dd*","fields":["guid"]}}

I have the same result with multi_match and query_string queries. I haven't been able to find anything in the documentation about a character minimum, so what is happening here?


Solution

  • Elastic does not require a minimum number of characters. What matters is the generated token.

    An exercise that helps to understand is to use _analyzer to see your index tokens.

    GET index_001/_analyze
    {
      "field": "guid",
      "text": [
        "3e49996c-1dd8-4230-8f6f-abe4236a6fc4"
      ]
    }
    

    You indicate the term 3e49996c-1dd8-4230-8f6f-abe4236a6fc4. Look how the tokens are:

     "tokens" : [
        {
          "token" : "3e49996c",
          "start_offset" : 0,
          "end_offset" : 8,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "1dd8",
          "start_offset" : 9,
          "end_offset" : 13,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "4230",
          "start_offset" : 14,
          "end_offset" : 18,
          "type" : "<NUM>",
          "position" : 2
        },
        {
          "token" : "8f6f",
          "start_offset" : 19,
          "end_offset" : 23,
          "type" : "<ALPHANUM>",
          "position" : 3
        },
        {
          "token" : "abe4236a6fc4",
          "start_offset" : 24,
          "end_offset" : 36,
          "type" : "<ALPHANUM>",
          "position" : 4
        }
      ]
    

    When you perform the search, the same analyzer that is used in the indexing will be used in the search. When you search for the term "9996c-1dd8*".

    GET index_001/_analyze
    {
      "field": "guid",
      "text": [
        "9996c-1dd8*"
      ]
    }
    

    The generated tokens are:

    {
      "tokens" : [
        {
          "token" : "9996c",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "1dd8",
          "start_offset" : 6,
          "end_offset" : 10,
          "type" : "<ALPHANUM>",
          "position" : 1
        }
      ]
    }
    

    Note that the inverted index will have the token 1dd8 and the term "9996c-1dd8*" generated the token "1dd8" so the match took place.

    When you test with the term "9996c-1dd*", no tokens match, so there are no results.

    GET index_001/_analyze
    {
      "field": "guid",
      "text": [
        "9996c-1dd*"
      ]
    }
    

    Tokens:

    {
      "tokens" : [
        {
          "token" : "9996c",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "1dd",
          "start_offset" : 6,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 1
        }
      ]
    }
    

    Token "1dd" is not equal to "1dd8".