Search code examples
elasticsearchtext-analysis

Correctly folding ASCII characters in Elasticsearch


I'm looking into supporting folding of non standard ASCII characters like this guide suggests.

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "folding": {
          "tokenizer": "standard",
          "filter":  [ "lowercase", "asciifolding" ]
        }
      }
    }
  }
}

Strangely enough, I'm not able to replicate the sample in the first snippet of code.

When I execute

GET /my_index/_analyze?analyzer=folding&text=My œsophagus caused a débâcle

the following tokens are returned:

sophagus, caused, a, d, b, cle

What I want to achieve is:

Variations of the spelling of words like "école" (e.g. ecole, ècole) should be treated as the same word.

Right now, if I execute

GET /my_index/_analyze?analyzer=folding&text=école ecole

I get the tokens cole, ecole

These are the settings I currently use for the text analysis of the documents

    "analysis": {
  "filter": {
    "french_stop": {
      "type": "stop",
        "stopwords": "_french_"
    },
      "french_elision": {
        "type": "elision",
          "articles": [
            "l",
            "m",
            "t",
            "qu",
            "n",
            "s",
            "j",
            "d",
            "c",
            "jusqu",
            "quoiqu",
            "lorsqu",
            "puisqu"
          ]
      },
        "french_stemmer": {
          "type": "stemmer",
            "language": "light_french"
        }
  },
    "analyzer": {
      "index_French": {
        "filter": [
          "french_elision",
          "lowercase",
          "french_stop",
          "french_stemmer"
        ],
          "char_filter": [
            "html_strip"
          ],
            "type": "custom",
              "tokenizer": "standard"
      },
        "sort_analyzer": {
          "type": "custom",
            "filter": [
              "lowercase"
            ],
              "tokenizer": "keyword"
        }
    }
}

My idea was to change the filters of the index_French analyzer so that the list is the following:

"filter": ["french_elision","lowercase","asciifolding","french_stop","french_stemmer"]

Thanks for your help.


Solution

  • In Sense you need to call the _analyze endpoint like this and it will work:

    POST /foldings/_analyze
    {
       "text": "My œsophagus caused a débâcle",
       "analyzer": "folding"
    }
    

    You'll get

    {
       "tokens": [
          {
             "token": "my",
             "start_offset": 0,
             "end_offset": 2,
             "type": "<ALPHANUM>",
             "position": 0
          },
          {
             "token": "oesophagus",
             "start_offset": 3,
             "end_offset": 12,
             "type": "<ALPHANUM>",
             "position": 1
          },
          {
             "token": "caused",
             "start_offset": 13,
             "end_offset": 19,
             "type": "<ALPHANUM>",
             "position": 2
          },
          {
             "token": "a",
             "start_offset": 20,
             "end_offset": 21,
             "type": "<ALPHANUM>",
             "position": 3
          },
          {
             "token": "debacle",
             "start_offset": 22,
             "end_offset": 29,
             "type": "<ALPHANUM>",
             "position": 4
          }
       ]
    }