By default Elasticsearch with English analyzer breaks at&t
into tokens at
, t
and then removes at
as a stopword.
POST _analyze
{
"analyzer": "english",
"text": "A word AT&T Procter&Gamble"
}
As a result tokens look like:
{
"tokens" : [
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "gambl",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
I want to be able to match exactly at&t
and at the same time to be able to search for procter&gamble
exactly and to be able to search for e.g. only procter
.
So I want to build an analizer which created both tokens
at&t
and t
for the at&t
string
and
procter
, gambl
, procter&gamble
for procter&gamble
.
It there a way to create such an analyzer? Or should I create 2 index fields - one for regular English analyzer and the other one for English except tokenization by &
?
Mappings: You can tokenize on whitespace and use a word delimiter filter to create tokens for at&t
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
}
}
}
}
}
Tokens:
{
"analyzer": "whitespace_with_acronymns",
"text": "A word AT&T Procter&Gamble"
}
Result: at&t is tokenized as at,t,att, so you can search this by at,t and at&t.
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "at",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "att",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "word",
"position" : 4
},
{
"token" : "proctergamble",
"start_offset" : 12,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "gamble",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 5
}
]
}
If you want to remove stop word "at", you can add stopword filter
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns",
"english_possessive_stemmer",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
}