Search code examples
elasticsearchstop-wordselasticsearch-analyzers

Elasticsearch - Stop analyzer doesn't allow number


I'm trying to build a search utility using elasticsearch 6.3.0 where any term can be searched within the database. I have applied Stop Analyzer to exclude some of the generic words. However, after having that analyzer system stopped giving me term with numbers as well.

Like if I search for news24 then it removes 24 and search only for "news" term in all records. Unsure why.

Below is the query I am using

{
   "from": 0,
   "size": 10,
   "explain": false,
   "stored_fields": [
      "_source"
   ],
   "query": {
      "function_score": {
         "query": {
            "multi_match": {
               "query": "news24",
               "analyzer": "stop",
               "fields": [
                  "title",
                  "keywords",
                  "url"
               ]
            }
         },
         "functions": [
            {
               "script_score": {
                  "script": "( (doc['isSponsered'].value == 'y') ? 100 : 0 )"
               }
            },
            {
               "script_score": {
                  "script": "doc['linksCount'].value"
               }
            }
         ],
         "score_mode": "sum",
         "boost_mode": "sum"
      }
   },
   "script_fields": {
      "custom_score": {
         "script": {
            "lang": "painless",
            "source": "params._source.linksArray"
         }
      }
   },
   "highlight": {
      "pre_tags": [
         ""
      ],
      "post_tags": [
         "<\/span>"
      ],
      "fields": {
         "title": {
            "type": "plain"
         },
         "keywords": {
            "type": "plain"
         },
         "description": {
            "type": "plain"
         },
         "url": {
            "type": "plain"
         }
      }
   }
}

Solution

  • That is because stop analyzer is just an extension of Simple Analyzer which makes use of Lowercase Tokenizer which would simply break terms into tokens if it encounters character which is not a letter (of course also lowercasing all the terms).

    So bascially if you have something like news24 what it does it, breaks it into news as it encountered 2.

    This is the default behaviour of the stop analyzer. If you intend to make use of stop words and still want to keep numerics in picture, then you would be required to create a custom analyzer as shown below:

    Mapping:

    POST sometestindex
    {  
       "settings":{  
          "analysis":{  
             "analyzer":{  
                "my_english_analyzer":{  
                   "type":"standard",
                   "stopwords":"_english_"
                }
             }
          }
       }
    }
    

    What it does it it makes use of Standard Analyzer which internally uses Standard Tokenizer and also ignores stop words.

    Analysis Query To Test

    POST sometestindex/_analyze
    {
      "analyzer": "my_english_analyzer",
      "text":     "the name of the channel is news24"
    }
    

    Query Result

    {
      "tokens": [
        {
          "token": "name",
          "start_offset": 4,
          "end_offset": 8,
          "type": "<ALPHANUM>",
          "position": 1
        },
        {
          "token": "channel",
          "start_offset": 16,
          "end_offset": 23,
          "type": "<ALPHANUM>",
          "position": 4
        },
        {
          "token": "news24",
          "start_offset": 27,
          "end_offset": 33,
          "type": "<ALPHANUM>",
          "position": 6
        }
      ]
    }
    

    You can see in the above tokens, that news24 is being preserved as token.

    Hope it helps!