Search code examples
elasticsearchelasticsearch-7

Elasticsearch tokenizer to keep (and concatenate) "and"


I am trying to make an Elasticsearch filter, analyzer and tokenizer to be able to normalize searches like:

  • "henry&william book" -> "henrywilliam book"
  • "henry & william book" -> "henrywilliam book"
  • "henry and william book" -> "henrywilliam book"
  • "henry william book" -> "henry william book"

In other words, I would like to normalize my "and" and "&" queries, but also concatenate the words between them.

I'm thinking of making a tokenizer that breaks "henry & william book" into tokens ["henry & william", "book"], and then make a character filter that makes the following replacements:

  • " & " -> ""
  • " and " -> ""
  • "&" -> ""

However, this feels a bit hackish. Is there a better way to do it?

The reason I can't just do this entirely in the analyzer/filter phase, is that it runs too late. In my attempts, Elasticsearch has already broken "henry & william" into just ["henry", "william"] before my analyzer/filter runs.


Solution

  • You can use a clever mix of two character filters that kick in before the tokenizer. The first character filter would map and onto & and the second character filter would get rid of the & and glue the two neighboring tokens together. This mix would also allow you to introduce other replacements, such as | and or for instance.

    PUT test
    {
      "settings": {
        "analysis": {
          "char_filter": {
            "and": {
              "type": "mapping",
              "mappings": [
                "and => &"
              ]
            },
            "&": {
              "type": "pattern_replace",
              "pattern": """(\w+)(\s*&\s*)(\w+)""",
              "replacement": "$1$3"
            }
          },
          "analyzer": {
            "my-analyzer": {
              "type": "custom",
              "char_filter": [
                "and", "&"
              ],
              "tokenizer": "keyword"
            }
          }
        }
      }
    }
    

    This would yields the following results:

    POST test/_analyze
    {
      "analyzer": "my-analyzer",
      "text": [
        "henry&william book"
      ]
    }
    
    Results =>
    
    {
      "tokens" : [
        {
          "token" : "henrywilliam book",
          "start_offset" : 0,
          "end_offset" : 18,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    
    POST test/_analyze
    {
      "analyzer": "my-analyzer",
      "text": [
        "henry & william book"
      ]
    }
    
    Results =>
    
    {
      "tokens" : [
        {
          "token" : "henrywilliam book",
          "start_offset" : 0,
          "end_offset" : 18,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    
    POST test/_analyze
    {
      "analyzer": "my-analyzer",
      "text": [
        "henry and william book"
      ]
    }
    
    Results =>
    
    {
      "tokens" : [
        {
          "token" : "henrywilliam book",
          "start_offset" : 0,
          "end_offset" : 18,
          "type" : "word",
          "position" : 0
        }
      ]
    }
    
    POST test/_analyze
    {
      "analyzer": "my-analyzer",
      "text": [
        "henry william book"
      ]
    }
    
    Results =>
    
    {
      "tokens" : [
        {
          "token" : "henry william book",
          "start_offset" : 0,
          "end_offset" : 18,
          "type" : "word",
          "position" : 0
        }
      ]
    }