Search code examples
elasticsearchfilterwildcarduuidn-gram

Elastic search Ngram tokenizer performance for UUID


I would like to partial filter on UUID, reference_id, and postal_code. For reference_id and postal_code, I know that they will be shorter than length 36. But UUID are 36 chars long. I'm thinking to set ngram tokenizer with:

min ngram 1

max ngram 36

Will this get really bad overtime in terms of speed and memory? Is there a better way to partial search UUID? For example I have 7e222584-0818-49b0-875b-2774f4bf939b and I want to be able to search it using 9b0


Solution

  • Yes, that will create an awful lot of tokens, actually 36 + 35 + 34 + 33 + ... + 1 = (1 + 36) * (36/2) = 666 tokens for each UUID and that's discouraged. Even when creating an ngram token filter, the default accepted distance between min and max is 1, so you'd have to override that in the index settings, which gives you a first indication that it might not be the right thing todo.

    You might want to give a try to the new wildcard query field which might do a better job.

    You can easily compare both approaches by creating two indexes and indexing the same amount (but a substantial one) of UUIDs in both and then comparing their size.

    First index with ngrams:

    PUT uuid1
    {
      "settings": {
        "index.max_ngram_diff": 36,
        "analysis": {
          "analyzer": {
            "uuid": {
              "tokenizer": "keyword",
              "filter": [
                "ngram"
              ]
            }
          },
          "filter": {
            "ngram": {
              "type": "ngram",
              "min_gram": 1,
              "max_gram": 36
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "uuid": {
            "type": "text",
            "analyzer": "uuid",
            "search_analyzer": "standard"
          }
        }
      }
    }
    

    Second index with wildcard:

    PUT uuid2
    {
      "mappings": {
        "properties": {
          "uuid": {
            "type": "wildcard"
          }
        }
      }
    }
    

    Then you index the same data in both:

    POST _bulk/_doc
    { "index": {"_index": "uuid1"}}
    { "uuid": "7e222584-0818-49b0-875b-2774f4bf939b"}
    { "index": {"_index": "uuid2"}}
    { "uuid": "7e222584-0818-49b0-875b-2774f4bf939b"}
    

    And finally you can compare their size and you can see that the uuid index will be bigger than the uuid2 index. Here by a factor of 3, but you might want to index a bit more data to figure out a better ratio:

    GET _cat/shards/uuid*?v
    
    index shard prirep state   docs  store ip          node
    uuid1 0     p      STARTED    1 10.6kb 10.0.33.86  instance-0000000062
    uuid2 0     p      STARTED    1  3.5kb 10.0.12.26  instance-0000000042
    

    Searching on the second index leveraging wildcard, can be done very easily like this, so it's a simple as a match query you'd do on the index with ngrams:

    POST uuid2/_search
    {
      "query": {
        "wildcard": {
          "uuid": "*9b0*"
        }
      }
    }