Search code examples
elasticsearchtokenize

Tokenize string based on delimiter in Elastic search


I need to tokenize string 36-3031.00|36-3021.00 to 36-3031.00 and 36-3021.00 using | delimiter.

I have tried like this,

PUT text
{
   "test1": {
  "settings": {
    "analysis" : {
            "tokenizer" : {
                "pipe_tokenizer" : {
                    "type" : "pattern",
                    "pattern" : "|"
                }
            },
            "analyzer" : {
                "pipe_analyzer" : {
                    "type" : "custom",
                    "tokenizer" : "pipe_tokenizer"
                }
            }
        }
  },
  "mappings": {
    "mytype": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "pipe_analyzer"
        }
      }
    }
  }
}}

But it does't produce exact. Can anyone sort out this use case ?


Solution

  • The following is the correct mapping you should use (including the index name in the REST PUT command). And the | character needs to be escaped:

    DELETE test1
    PUT test1
    {
      "settings": {
        "analysis": {
          "tokenizer": {
            "pipe_tokenizer": {
              "type": "pattern",
              "pattern": "\\|"
            }
          },
          "analyzer": {
            "pipe_analyzer": {
              "type": "custom",
              "tokenizer": "pipe_tokenizer"
            }
          }
        }
      },
      "mappings": {
        "mytype": {
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "pipe_analyzer"
            }
          }
        }
      }
    }
    
    POST /test1/mytype/1
    {"text":"36-3031.00|36-3021.00"}
    
    GET /test1/_analyze
    {"field":"text","text":"36-3031.00|36-3021.00"}