We have elastic search configured with a whitespace analyzer in our application. The words are tokenized on whitespace, so a name like <fantastic> project
is indexed as
["<fantastic>", "project"]
and ABC-123-def project is indexed as
["ABC-123-def", "project"]
When we then search for ABC-* the expected project turns up. But, if we specifically search for <fantastic>
it won't show up at all. It's as though Lucene/Elastic Search ignores any search term that includes angle brackets. However, we can search for fantastic
, or <*fantastic*
or *fantastic*
and it finds it fine, even though the word is not indexed separately from the angle brackets.
The standard analyzer tokenizes on any non-alphanumeric character. <fantatsic>
project is indexed as
["fantastic", "project"]
and ABC-123-def project is indexed as
["ABC", "123", "def", "project"]
This breaks the ability to search successfully using ABC-123-*
. However, what we get with the standard analyzer is that someone can then specifically search for <fantastic>
and it returns the desired results.
If instead of a standard analyzer we add a char_filter to the whitespace analyzer that filters out the angle brackets on tags, (replace <(.*)>
with $1
) it will be indexed thus:
<fantatsic> project
is indexed as
["fantastic", "project"]
(no angle brackets). And ABC-123-def project is indexed as
["ABC-123-def", "project"]
It looks promising, but we end up with the same results as for the plain whitespace analyzer: When we search specifically for <fantastic>
, we get nothing, but *fantastic*
works fine.
Can anyone out on Stack Overflow explain this weirdness?
You could create a tokenizer for special characters, see the following example
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"custom_filter" : {
"type" : "word_delimiter",
"type_table": ["> => ALPHA", "< => ALPHA"]
}
},
"analyzer" : {
"custom_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "custom_filter"]
}
}
}
},
"mappings" : {
"my_type" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "custom_analyzer"
}
}
}
}
}
<> as ALPHA character causing the underlying word_delimiter to treat them as alphabetic characters.