Search code examples
elasticsearchsynonymsnowball-stemmerelasticsearch-synonym

Elasticsearch synonym filter after stemmer sometimes does not work properly


With a simple analyzer that applies a synonym filter after a stemmer, sometimes for some stemmed words, the synonym does not work while the exact stemmed word is used in the synonym filter.

First, I created an analyzer that applies a synonym filter after the French snowball filter.

curl -XPUT "http://localhost:9200/my_index" -H 'Content-Type: application/json' -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "my_snow": {
          "type": "snowball",
          "language": "French"
        },
        "my_synonym_filter": {
          "type": "synonym", 
          "synonyms": [ 
            "autr => synonym_1",
            "journali => synonym_2",
            "journalier => synonym_3"
          ]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_snow",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}'

Because my synonym filter is after the stemmer, I had to find out what are words stemmed into. To find what stemmed word to put in the synonym filter, I ran the /my_index/_analyze query with "explain": "true" without the synonyms. The query it gave me the stemmed tokens that I put in the synonym filter.

Then, I tested this analyzer with the text "journalière". It is stemmed to "journali" as showed below, and the synonym filter transforms it to "synonym_3" instead of "synonym_2"! And without the line "journalier => synonym_3" in the filter, it would not be transformed at all! Here are the query and the response :

curl -XGET "http://localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d '
{
  "analyzer" : "my_analyzer",
  "text" : "journalière",
  "explain" : "true"
}' | json_pp
{
   "detail" : {
      "charfilters" : [],
      "custom_analyzer" : true,
      "tokenfilters" : [
         {
            "name" : "my_snow",
            "tokens" : [
               {
                  "bytes" : "[6a 6f 75 72 6e 61 6c 69]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "journali",
                  "type" : "<ALPHANUM>"
               }
            ]
         },
         {
            "name" : "my_synonym_filter",
            "tokens" : [
               {
                  "bytes" : "[73 79 6e 6f 6e 79 6d 5f 33]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "synonym_3",
                  "type" : "SYNONYM"
               }
            ]
         }
      ],
      "tokenizer" : {
         "name" : "standard",
         "tokens" : [
            {
               "bytes" : "[6a 6f 75 72 6e 61 6c 69 c3 a8 72 65]",
               "end_offset" : 11,
               "position" : 0,
               "positionLength" : 1,
               "start_offset" : 0,
               "termFrequency" : 1,
               "token" : "journalière",
               "type" : "<ALPHANUM>"
            }
         ]
      }
   }
}

I also tested the analyzer with the word "journaliere" to see if the accent had something to do with the bug. It is stemmed to "journalier", and then the synonym filter does not work. See the query and response below :

curl -XGET "http://localhost:9200/my_index/_analyze" -H 'Content-Type: application/json' -d '
{
  "analyzer" : "my_analyzer",
  "text" : "journaliere",
  "explain" : "true"
}' | json_pp
{
   "detail" : {
      "charfilters" : [],
      "custom_analyzer" : true,
      "tokenfilters" : [
         {
            "name" : "my_snow",
            "tokens" : [
               {
                  "bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "journalier",
                  "type" : "<ALPHANUM>"
               }
            ]
         },
         {
            "name" : "my_synonym_filter",
            "tokens" : [
               {
                  "bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72]",
                  "end_offset" : 11,
                  "keyword" : false,
                  "position" : 0,
                  "positionLength" : 1,
                  "start_offset" : 0,
                  "termFrequency" : 1,
                  "token" : "journalier",
                  "type" : "<ALPHANUM>"
               }
            ]
         }
      ],
      "tokenizer" : {
         "name" : "standard",
         "tokens" : [
            {
               "bytes" : "[6a 6f 75 72 6e 61 6c 69 65 72 65]",
               "end_offset" : 11,
               "position" : 0,
               "positionLength" : 1,
               "start_offset" : 0,
               "termFrequency" : 1,
               "token" : "journaliere",
               "type" : "<ALPHANUM>"
            }
         ]
      }
   }
}

Finally, so be sure other words worked I tested with "autre". It is stemmed to "autr" and then gives "synonym_1" which is correct.

I'm using Elasicsearch 7.17.9, here is my docker-compose config :

version: '3.7'
services:
    elasticsearch:
        image: docker.elastic.co/elasticsearch/elasticsearch:7.17.9
        container_name: elasticsearch
        environment:
            - discovery.type=single-node
            - bootstrap.memory_lock=true
            - "ES_JAVA_OPTS=-Xms1000m -Xmx2000m"
        ulimits:
            memlock:
                soft: -1
                hard: -1
        volumes:
            - elasticsearch-data:/usr/share/elasticsearch/data
        ports:
            - 9200:9200
volumes:
    elasticsearch-data:
        driver: local

It seems that the tokens outputted by the explained analysis are not always the same words used by the synonym filter. Is there a way to find out what is the synonym of "journaliere" after stemmer or is there a bug somewhere?


Solution

  • I opened an issue, and in the end it was not a bug. When using synonyms after the stemmer, we should not put stemmed tokens in the synonym filter.

    Here is how I should've defined my synonym filter :

    "my_synonym_filter": {
      "type": "synonym_graph",
      "synonyms": [
        "autre => synonym_1",
        "journalière => synonym_2",
        "journaliere => synonym_3"
      ]
    }