Search code examples
elasticsearchlucenestemmingporter-stemmer

Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4


We have two instances of elastic search, one running 1.2.1 and one 1.4, the settings and the mapping is identical on the indices running on both instances, yet the results are different.

The setting for the default analyzer:

....
analysis: {
 filter: {
  ourEnglishStopWords: {
   type: "stop",
   stopwords: "_english_"
  },
  ourEnglishFilter: {
   type: "stemmer",
   name: "english"
  }
 },
 analyzer: {
  default: {
   filter: [
    "asciifolding",
    "lowercase",
    "ourEnglishStopWords",
    "ourEnglishFilter"
   ],
   tokenizer: "standard"
  }
 }
},
...

The difference between elastic search versions appears when indexing/searching for possessive forms, whereas in 1.2.1 "player", "players" and "player's" would return the same results, in 1.4 first two ("player" and "players") have identical result set, while "player's" is not matching the set Is it a known difference? What is the the right way to get the same behavior in 1.4 and up?


Solution

  • I think this is the change, introduced in 1.3.0:

    The StemmerTokenFilter had a number of issues:

    1. english returned the slow snowball English stemmer
    2. porter2 returned the snowball Porter stemmer (v1)

    Changes:

    1. english now returns the fast PorterStemmer (for indices created from v1.3.0 onwards)
    2. porter2 now returns the snowball English stemmer (for indices created from v1.3.0 onwards)

    According to that github issue, you can either to change your mapping to:

        "ourEnglishFilter": {
          "type": "stemmer",
          "name": "porter2"
        }
    

    or try something else:

     "filter": {
        "ourEnglishStopWords": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "ourEnglishFilter": {
          "type": "stemmer",
          "name": "english"
        },
        "possesiveEnglish": {
          "type": "stemmer",
          "name": "possessive_english"
        }
      },
      "analyzer": {
        "default": {
          "filter": [
            "asciifolding",
            "lowercase",
            "ourEnglishStopWords",
            "possesiveEnglish",
            "ourEnglishFilter"
          ],
          "tokenizer": "standard"
        }
      }