Search code examples
elasticsearchelasticsearch-analyzers

How to exclude asterisks while searching with analyzer


I need to search by an array of values, and each value can be either simple text or text with askterisks(*). For example:

["MYULTRATEXT"]

And I have the next index(i have a really big index, so I will simplify it):

................
{
    "settings": {
         "analysis": {
            "char_filter": {
              "asterisk_remove": {
                "type": "pattern_replace",
                "pattern": "(\\d+)*(?=\\d)",
                "replacement": "1$"
              }
            },
            "analyzer": {
              "custom_search_analyzer": {
                "char_filter": [
                  "asterisk_remove"
                ],
                "type": "custom",
                "tokenizer": "keyword"
              }
            }
        }
    },
        "mappings": {
        "_doc": {
            "properties": {
               "name": {
                  "type": "text",
                  "analyzer":"keyword",
                  "search_analyzer": "custom_search_analyzer"
               },
     ......................

And all data in the index is stored with asterisks * e.g.:

curl -X PUT "localhost:9200/locations/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
   "name" : "MY*ULTRA*TEXT"
}

I need to return exact the same name value when I search by this string MYULTRATEXT

curl -XPOST 'localhost:9200/locations/_search?pretty' -d '
{
  "query": { terms: { "name": ["MYULTRATEXT"] }  }
}'

It Should return MY*ULTRA*TEXT, but it does not work, so can't find a workaround. Any thoughts?

I tried pattern_replace but seems like I am doing something wrong or I am missing something here.

So I need to replace all * to empty `` while searching


Solution

  • There appears to be a problem with the regex you provided and the replacement pattern.

    I think what you want is:

                "char_filter": {
                  "asterisk_remove": {
                    "type": "pattern_replace",
                    "pattern": "(\\w+)\\*(?=\\w)",
                    "replacement": "$1"
                  }
                }
    

    Note the following changes:

    • \d => \w (match word characters instead of only digits)
    • escape * since asterisks have a special meaning for regexes
    • 1$ => $1 ($<GROUPNUM> is how you reference captured groups)

    To see how Elasticsearch will analyze the text against an analyzer, or to check that you defined an analyzer correctly, Elasticsearch has the ANALYZE API endpoint that you can use: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

    If you try this API with your current definition of custom_search_analyzer, you will find that "MY*ULTRA*TEXT" is analyzed to "MY*ULTRA*TEXT" and not "MYULTRATEXT" as you intend.

    I have a personal app that I use to more easily interact with and visualize the results of the ANALYZE API. I tried your example and you can find it here: Elasticsearch Analysis Inspector.