Search code examples
elasticsearchfilterstop-wordssynonym

Elasticsearch: Unexpected interaction between synonym_graph and stop filter in custom analyzer


Description

I'm tring to query with multi words synonym including a stop word. Let's start with an exemple to explain.

I've got the following documents into a index.

  • foo
  • bar
  • foo bar
  • foo of bar
  • fb

Expected result with the query {"query":{"match":{"test":{"query":"foo of bar"}}}} is to return documents:

  • foo bar
  • foo of bar
  • fb

configuration

In this exemple, I got 2 filters:

  • stop: will remove the token of
  • synonym_graph: handle synonymes fb, foo bar, foo of bar

Mapping

{
  "properties": {
    "test": {
      "type": "text",
      "analyzer": "test_index_analyzer",
      "search_analyzer": "test_search_analyzer"
    }
}

Settings

{
    "settings" : {
        "index": {
            "number_of_shards": 1,
            "number_of_replicas": 0,
            "analysis": {
                "analyzer": {
                    "test_index_analyzer": {
                        "type": "custom",
                        "tokenizer": "whitespace",
                        "filter": [
                            "english_stop"
                        ]
                    },
                    "test_search_analyzer": {
                        "type": "custom",
                        "tokenizer": "whitespace",
                        "filter": [
                            "english_stop",
                            "english_syn"
                        ]
                    }
                },
                "filter": { 
                    "english_stop": {
                        "type": "stop",
                        "stopwords": "_english_",
                        "ignore_case": true,
                        "remove_trailing": false
                    },
                    "english_syn": {
                        "type": "synonym_graph",
                        "synonyms": [
                            "fb,foo of bar",
                            "fb,foo bar"
                        ]
                    }
                }
            }
        }
    }
}

Result

token format: "token,start_offset-end_offset,type / position / positionLength"

Query Search Result index analysys Search analysys
fb fb fb,0-2,word,0,1 foo,0-2,SYNONYM / 0 / 1
foo,0-2,SYNONYM / 0 / 3
fb,0-2,word / 0 / 4
bar,0-2,SYNONYM / 2 / 2
bar,0-2,SYNONYM / 3 / 1
foo of bar fb foo,0-3,word,0,1
bar,7-10,word,2,1
fb,0-10,SYNONYM / 0 / 3
foo,0-3,word / 0 / 1
bar,7-10,word / 2 / 1
foo bar fb,foo bar foo,0-3,word,0,1
bar,4-7,word,1,1
fb,0-7,SYNONYM / 0 / 2
foo,0-3,word / 0 / 1
bar,4-7,word / 1 / 1

All search expect to return the 3 lines:

  • fb
  • foo bar
  • foo of bar

Note: foo of bar is never returned

My guess is than foo of bar got indexed with position [foo, ,bar] by the stop filter and synonym is looking for [foo, bar].

Do you have any advice to reach my goal ?


Solution

  • When you use stopwords filter the position of word will be kept so if you check the analyzer result for foo of bar you will get below result:

    {
      "tokens" : [
        {
          "token" : "foo",
          "start_offset" : 0,
          "end_offset" : 3,
          "type" : "word",
          "position" : 0
        },
        {
          "token" : "bar",
          "start_offset" : 7,
          "end_offset" : 10,
          "type" : "word",
          "position" : 2
        }
      ]
    }
    
    

    As you can see you get 'foo' token in position of zero and 'bar' in position of two, so you synonym filter can't find this document.

    To solve your problem you should first apply synonym filter and then remove stop words like below.

    "test_search_analyzer": {
        "type": "custom",
        "tokenizer": "whitespace",
        "filter": [
          "english_syn",
          "english_stop"
        ]
      }
    

    and you should add 'foo bar, foo of bar' to your synonym list.

    In my opinion keeping stop word is necessary because it can help getting more precise search results(especially with BM25 similarity that ES uses.), you can check elastic search official article about it here.