Search code examples
elasticsearchspecial-characterstokenize

how to tokenize and search with special characters in ElasticSearch


I need texts like #tag1 quick brown fox #tag2 to be tokenized into #tag1, quick, brown, fox, #tag2, so I can search this text on any of the patterns #tag1, quick, brown, fox, #tag2 where the symbol # must be included in the search term. In my index mapping I have a text type field (to search on quick, brown, fox) with the keyword type subfield (to search on #tag), and when I use search term #tag it gives me only the match on the first token #tag1 but not on #tag1. I think what I need is a tokenizer that will produce word boundary tokens that inlcude special chars. Can someone suggest a solution?


Solution

  • If you want to include # in your search, you should use different analyzer than standard analyzer because # will be removed during analyze phase. You can use whitespace analyzer to analyze your text field. Also for search you can use wildcard pattern:

    Query:

    GET [Your index name]/_search
    {
      "query": {
        "match": {
          "[FieldName]": "#tag*"
        }
      }
    }
    

    You can find information about Elastic built-in analyzer here.

    UPDATE:

    Whitespace analyzer:

    POST /_analyze
    {
      "analyzer": "whitespace",
      "text": "#tag1 quick #tag2"
    }
    

    Result:

    {
      "tokens" : [
        {
          "token" : "#tag1",
          "start_offset" : 0,
          "end_offset" : 5,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "quick",
          "start_offset" : 6,
          "end_offset" : 11,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "#tag2",
          "start_offset" : 12,
          "end_offset" : 17,
          "type" : "word",
          "position" : 2
        }
      ]
    }
    

    As you can see #tag1 and #tag2 are two tokens.