Search code examples
regexelasticsearchlucenetagselasticsearch-plugin

Using Elastic Search to retrieve tag contents and hyphenated words


We have elastic search configured with a whitespace analyzer in our application. The words are tokenized on whitespace, so a name like <fantastic> project is indexed as

["<fantastic>", "project"]

and ABC-123-def project is indexed as

["ABC-123-def", "project"]

When we then search for ABC-* the expected project turns up. But, if we specifically search for <fantastic> it won't show up at all. It's as though Lucene/Elastic Search ignores any search term that includes angle brackets. However, we can search for fantastic, or <*fantastic* or *fantastic* and it finds it fine, even though the word is not indexed separately from the angle brackets.

The standard analyzer tokenizes on any non-alphanumeric character. <fantatsic> project is indexed as

["fantastic", "project"]

and ABC-123-def project is indexed as

["ABC", "123", "def", "project"]

This breaks the ability to search successfully using ABC-123-*. However, what we get with the standard analyzer is that someone can then specifically search for <fantastic> and it returns the desired results.

If instead of a standard analyzer we add a char_filter to the whitespace analyzer that filters out the angle brackets on tags, (replace <(.*)> with $1) it will be indexed thus: <fantatsic> project is indexed as

["fantastic", "project"]

(no angle brackets). And ABC-123-def project is indexed as

["ABC-123-def", "project"]

It looks promising, but we end up with the same results as for the plain whitespace analyzer: When we search specifically for <fantastic>, we get nothing, but *fantastic* works fine.

Can anyone out on Stack Overflow explain this weirdness?


Solution

  • You could create a tokenizer for special characters, see the following example

    {
        "settings" : {
            "index" : {
                "number_of_shards" : 1,
                "number_of_replicas" : 1
            },  
            "analysis" : {
                "filter" : {
                    "custom_filter" : {
                        "type" : "word_delimiter",
                        "type_table": ["> => ALPHA", "< => ALPHA"]
                    }   
                },
                "analyzer" : {
                    "custom_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "whitespace",
                        "filter" : ["lowercase", "custom_filter"]
                    }
                }
            }
        },
        "mappings" : {
            "my_type" : {
                "properties" : {
                    "msg" : {
                        "type" : "string",
                        "analyzer" : "custom_analyzer"
                    }
                }
            }
        }
    }
    

    <> as ALPHA character causing the underlying word_delimiter to treat them as alphabetic characters.