Search code examples
elasticsearchelasticsearch-5elasticsearch-dsl

Elastic Search - how to use language analyzer with UTF-8 filter?


I have a problem with ElasticSearch language analyzer. I am working on Lithuanian language, so I am using Lithuanian language analyzer. Analyzer works fine and I got all word cases I need. For example, I index Lithuania city "Klaipėda":

PUT /cities/city/1
{
  "name": "Klaipėda"
}

Problem is that I also need to get a result, when I am searching "Klaipėda" only in Latin alphabet ("Klaipeda") and in all Lithuanian cases:

  1. Nomanitive case: "Klaipeda"
  2. Genitive case: "Klaipedos"
  3. ...
  4. Locative case: "Klaipedoje"

"Klaipėda", "Klaipėdos", "Klaipėdoje" - works, but "Klaipeda", "Klaipedos", "Klaipedoje" - not.

My index:

PUT /cities
{
  "mappings": {
    "city": {
      "properties": {
        "name": {
          "type":     "string",
          "analyzer": "lithuanian",
            "fields": {
              "folded": {
              "type": "string",
              "analyzer": "md_folded_analyzer"
             }
           }
        }
      }
    }
  },
  "settings": {
      "analysis": {
        "analyzer": {
          "md_folded_analyzer": {
            "type": "lithuanian",
            "tokenizer": "standard",
            "filter":  [ 
              "lowercase", 
              "asciifolding",
              "lithuanian_stop",
              "lithuanian_keywords",
              "lithuanian_stemmer"
            ]
          }
        }
     }
  }
}

and search query:

GET /cities/_search
{
  "query": {
    "multi_match" : {
      "type":     "most_fields",
      "query":    "klaipeda", 
      "fields": [ "name", "name.folded" ]
    }
  }
}

What I am doing wrong? Thanks for help.


Solution

  • The technique you are using here is so-called multi-fields. The limitation of the underlying name.folded field is that you can't perform search against it - you can perform only sorting by name.folded and aggregation.

    To make a way round this I've come up with the following set-up:

    1. Separate fields set-up (to eliminate duplicates - just specify copy_to):

      curl -XPUT http://localhost:9200/cities -d '
      {
        "mappings": {
          "city": {
            "properties": {
              "name": {
                "type":     "string",
                "analyzer": "lithuanian",
                "copy_to": "folded",
              },
              "folded": {
                "type": "string",
                "analyzer": "md_folded_analyzer"
              }
            }
          }
        }
      }'
      
    2. Change the type of your analyzer to custom as it described here, because otherwise the asciifolding is not got into the config. And more important - asciifolding should go after all stemming / stop-words in Lithuanian language, because after folding the word can miss desired sense.

      curl -XPUT http://localhost:9200/my_cities -d '
      {
        "settings": {
            "analysis": {
              "filter": {
                "lithuanian_stop": {
                  "type":       "stop",
                  "stopwords":  "_lithuanian_"
                },
                "lithuanian_stemmer": {
                  "type":       "stemmer",
                  "language":   "lithuanian"
                }
              },
              "analyzer": {
                "md_folded_analyzer": {
                  "type": "custom",
                  "tokenizer": "standard",
                  "filter":  [
                    "lowercase",
                    "lithuanian_stop",
                    "lithuanian_stemmer",
                    "asciifolding"
                  ]
                }
              }
           }
        }
      }
      

      Sorry I've eliminated lithuanian_keywords - it requires additional set-up, which I missed here. But I hope you've got the idea.