Elasticsearch - at&t and procter&gamble cases

By default Elasticsearch with English analyzer breaks at&t into tokens at, t and then removes at as a stopword.

POST _analyze
{
  "analyzer": "english", 
  "text": "A word AT&T Procter&Gamble"
}

As a result tokens look like:

{
  "tokens" : [
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "gambl",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

I want to be able to match exactly at&t and at the same time to be able to search for procter&gamble exactly and to be able to search for e.g. only procter.

So I want to build an analizer which created both tokens at&t and t for the at&t string and procter, gambl, procter&gamble for procter&gamble.

It there a way to create such an analyzer? Or should I create 2 index fields - one for regular English analyzer and the other one for English except tokenization by & ?

Solution

Mappings: You can tokenize on whitespace and use a word delimiter filter to create tokens for at&t

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_with_acronymns": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "acronymns"
          ]
        }
      },
      "filter": {
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all": true
        }
      }
    }
  }
}

Tokens:

{
  "analyzer": "whitespace_with_acronymns", 
  "text": "A word AT&T Procter&Gamble"
}

Result: at&t is tokenized as at,t,att, so you can search this by at,t and at&t.

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "at",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "att",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "proctergamble",
      "start_offset" : 12,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "gamble",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "word",
      "position" : 5
    }
  ]
}

If you want to remove stop word "at", you can add stopword filter

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_with_acronymns": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "acronymns",
            "english_possessive_stemmer",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      },
      "filter": {
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all": true
        },
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      }
    }
  }
}