Search code examples
regexelasticsearchtokenize

Pattern Tokenizer for extracting file name


I want to tokenize "a.b.c" into a, a.b, a.b.c, b.c, b, c parts in ElasticSearch. I tried some regex but updating tokenizer is tedious and I'm very bad at regex so I'm asking for help.

I already tried this formulas but they didn't gave me what I want:

[(^\\.)]+
[(.+\\.)]+
[^\\p{L}\\d]+

Solution

  • Try this,

    PUT my_sample
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "path_hierarchy",
              "delimiter": ".",
              "replacement": "."
            }
          }
        }
      }
    }
    

    then,

    POST my_sample/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "a.b.c"
    }
    

    it will produces the following terms:

    [ a.b.c., a.b., b.c., a., b., c. ]
    

    then you simple handle it through your program