Search code examples
elasticsearchelasticsearch-analyzers

Elasticsearch - at&t and procter&gamble cases


By default Elasticsearch with English analyzer breaks at&t into tokens at, t and then removes at as a stopword.

POST _analyze
{
  "analyzer": "english", 
  "text": "A word AT&T Procter&Gamble"
}

As a result tokens look like:

{
  "tokens" : [
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "gambl",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

I want to be able to match exactly at&t and at the same time to be able to search for procter&gamble exactly and to be able to search for e.g. only procter.

So I want to build an analizer which created both tokens at&t and t for the at&t string and procter, gambl, procter&gamble for procter&gamble.

It there a way to create such an analyzer? Or should I create 2 index fields - one for regular English analyzer and the other one for English except tokenization by & ?


Solution

  • Mappings: You can tokenize on whitespace and use a word delimiter filter to create tokens for at&t

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "whitespace_with_acronymns": {
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "acronymns"
              ]
            }
          },
          "filter": {
            "acronymns": {
              "type": "word_delimiter_graph",
              "catenate_all": true
            }
          }
        }
      }
    }
    

    Tokens:

    {
      "analyzer": "whitespace_with_acronymns", 
      "text": "A word AT&T Procter&Gamble"
    }
    

    Result: at&t is tokenized as at,t,att, so you can search this by at,t and at&t.

    {
      "tokens" : [
        {
          "token" : "a",
          "start_offset" : 0,
          "end_offset" : 1,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "word",
          "start_offset" : 2,
          "end_offset" : 6,
          "type" : "word",
          "position" : 1
        },
        {
          "token" : "at",
          "start_offset" : 7,
          "end_offset" : 9,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "att",
          "start_offset" : 7,
          "end_offset" : 11,
          "type" : "word",
          "position" : 2
        },
        {
          "token" : "t",
          "start_offset" : 10,
          "end_offset" : 11,
          "type" : "word",
          "position" : 3
        },
        {
          "token" : "procter",
          "start_offset" : 12,
          "end_offset" : 19,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "proctergamble",
          "start_offset" : 12,
          "end_offset" : 26,
          "type" : "word",
          "position" : 4
        },
        {
          "token" : "gamble",
          "start_offset" : 20,
          "end_offset" : 26,
          "type" : "word",
          "position" : 5
        }
      ]
    }
    

    If you want to remove stop word "at", you can add stopword filter

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "whitespace_with_acronymns": {
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "acronymns",
                "english_possessive_stemmer",
                "english_stop",
                "english_keywords",
                "english_stemmer"
              ]
            }
          },
          "filter": {
            "acronymns": {
              "type": "word_delimiter_graph",
              "catenate_all": true
            },
            "english_stop": {
              "type":       "stop",
              "stopwords":  "_english_" 
            },
            "english_keywords": {
              "type":       "keyword_marker",
              "keywords":   ["example"] 
            },
            "english_stemmer": {
              "type":       "stemmer",
              "language":   "english"
            },
            "english_possessive_stemmer": {
              "type":       "stemmer",
              "language":   "possessive_english"
            }
          }
        }
      }
    }