I am trying to produce ngram features with elasticsearch analyzer, in particular, I would like to add leading/trailing space to the word. For example, if the word is "2 Quick Foxes", the ngram features with leading/trailing space will be:
" 2 ", "2 Q", .....," "Fox", "oxe", "xes", "es "
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer",
"text": "2 Quick Foxes"
}
You could add two pattern replace character filters -- one for the leading whitespace, the other for the trailing:
PUT my-index-000001
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer",
"char_filter": [
"leading_space",
"trailing_space"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 3,
"token_chars": [
"letter",
"digit",
"whitespace"
]
}
},
"char_filter": {
"leading_space": {
"type": "pattern_replace",
"pattern": "(^.)",
"replacement": " $1"
},
"trailing_space": {
"type": "pattern_replace",
"pattern": "(.$)",
"replacement": "$1 "
}
}
}
}
}
}
Notice the added whitespace
to the token_chars
of my_tokenizer
-- the above won't work without it.