regex elasticsearch lucene tags elasticsearch-plugin

Using Elastic Search to retrieve tag contents and hyphenated words

We have elastic search configured with a whitespace analyzer in our application. The words are tokenized on whitespace, so a name like <fantastic> project is indexed as

["<fantastic>", "project"]

and ABC-123-def project is indexed as

["ABC-123-def", "project"]

When we then search for ABC-* the expected project turns up. But, if we specifically search for <fantastic> it won't show up at all. It's as though Lucene/Elastic Search ignores any search term that includes angle brackets. However, we can search for fantastic, or <*fantastic* or *fantastic* and it finds it fine, even though the word is not indexed separately from the angle brackets.

The standard analyzer tokenizes on any non-alphanumeric character. <fantatsic> project is indexed as

["fantastic", "project"]

and ABC-123-def project is indexed as

["ABC", "123", "def", "project"]

This breaks the ability to search successfully using ABC-123-*. However, what we get with the standard analyzer is that someone can then specifically search for <fantastic> and it returns the desired results.

If instead of a standard analyzer we add a char_filter to the whitespace analyzer that filters out the angle brackets on tags, (replace <(.*)> with $1) it will be indexed thus: <fantatsic> project is indexed as

["fantastic", "project"]

(no angle brackets). And ABC-123-def project is indexed as

["ABC-123-def", "project"]

It looks promising, but we end up with the same results as for the plain whitespace analyzer: When we search specifically for <fantastic>, we get nothing, but *fantastic* works fine.

Can anyone out on Stack Overflow explain this weirdness?

Solution

You could create a tokenizer for special characters, see the following example

{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        },  
        "analysis" : {
            "filter" : {
                "custom_filter" : {
                    "type" : "word_delimiter",
                    "type_table": ["> => ALPHA", "< => ALPHA"]
                }   
            },
            "analyzer" : {
                "custom_analyzer" : {
                    "type" : "custom",
                    "tokenizer" : "whitespace",
                    "filter" : ["lowercase", "custom_filter"]
                }
            }
        }
    },
    "mappings" : {
        "my_type" : {
            "properties" : {
                "msg" : {
                    "type" : "string",
                    "analyzer" : "custom_analyzer"
                }
            }
        }
    }
}

<> as ALPHA character causing the underlying word_delimiter to treat them as alphabetic characters.