Search code examples
elasticsearchkibanastemming

Elasticsearch Spanish stemming not working with "rojo" color


I am fairly new to ElasticSearch. I am trying to analyze inputs in Spanish but there seems to be an issue with the color "rojo" (red in Spanish).

According the stemmer demo, the string "Polera color rojo" (red colored shirt) should be stemmed to "poler color roj" and "Polera roja" (red shirt) should be "poler roj", enabling me to search as "rojo" or "roja" and getting both results.

First example using the stemmer demo

Second example using the stemmer demo

I initialized the index with the following code in Kibana's console:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type": "stop",
          "stopwords": "_spanish_"
        },
        "spanish_stemmer": {
          "type": "stemmer",
          "language": "spanish"
        }
      },
      "analyzer": {
        "default_search": {
          "type":"spanish",
          "filter": [
            "lowercase",
            "spanish_stop",
            "spanish_stemmer"
          ]
        }
      }
    }
  },
  "mappings":{
    "properties":{
      "fullname":{
        "type":"text",
        "analyzer":"default_search"
      }
    }
  }
}

And made a query with the following code:

POST /test/_analyze
{
  "analyzer": "default_search",
  "text": "polera color rojo"
}

What I received as a response was the following:

{
  "tokens" : [
    {
      "token" : "poler",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "color",
      "start_offset" : 7,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "rojo",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

As you can see, "polera" got correctly stemmed as "poler", but "rojo" was not. I also tried other colors and things, adding more text, etc. but the issue seems to be specifically with "rojo".

I managed to replicate the problem in an Elasticsearch instance in AWS and a local one. It does work with plurals form like "rojas" and "rojos", leaving them as "roj".

Maybe I configured something wrong or it is actually an issue with the Spanish stemming in Elasticsearch?

EDIT: It seems the issue is with the word length? The same issue happens with "coma" and "como", which should be stemmed as "com" but doesn't. If I put "comas", it gets stemmed to "com".


Solution

  • Seems like the stemmer type has a minimum token lengh, I tried with "rojos" instead of "rojo" and stems to "roj".

    You can try with another approach like Snowball Stemming

    PUT /test_spanish
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "standard",
              "filter": [ "lowercase", "my_snow" ]
            }
          },
          "filter": {
            "my_snow": {
              "type": "snowball",
              "language": "Spanish"
            }
          }
        }
      }
    }
    
    POST /test_spanish/_analyze
    {
      "analyzer": "my_analyzer",
      "text": "polera color rojo"
    }
    
    {
      "tokens" : [
        {
          "token" : "poler",
          "start_offset" : 0,
          "end_offset" : 6,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "color",
          "start_offset" : 7,
          "end_offset" : 12,
          "type" : "<ALPHANUM>",
          "position" : 1
        },
        {
          "token" : "roj",
          "start_offset" : 13,
          "end_offset" : 17,
          "type" : "<ALPHANUM>",
          "position" : 2
        }
      ]
    }