Search code examples
unicodeelasticsearchicu

Elasticsearch uses wrong Case Folding for Unicode Characters


In one of my project, I am trying to use Elasticsearch (1.7) to query data. But, It returns different result for unicode characters depending on if they are uppercased or not. I try to use icu_analyzer to get rid of problem.

Here is a small example to demonstrate my problem. My index is like this,

$ curl -X PUT http://localhost:9200/tr-test -d '
{
  "mappings": {
    "names": {
      "properties": {
        "name": {
          "type": "string"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "1",
      "analysis": {
        "filter": {
          "nfkc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfkc"
          }
        },
        "analyzer": {
          "my_lowercaser": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfkc_normalizer"
            ]
          }
        }
      }
    }
  }
}'

Here is a test data to demonstrate my problem.

$ curl -X POST http://10.22.20.140:9200/tr-test/_bulk -d '
{"index": {"_type":"names", "_index":"tr-test"}}
{"name":"BAHADIR"}'

Here is a similar query. If I query using BAHADIR as query_string, I can easily find my test data.

$ curl -X POST http://10.22.20.140:9200/tr-test/_search -d '
{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "BAHADIR"
        }
      }
    }
  }
}'

In Turkish, lowercased version of of BAHADIR is bahadır. I am expecting same result while querying with bahadır. But Elasticsearch cannot find my data. And I cannot fix that with using ICU for analysis. It works perfectly fine if I query with bahadir.

I already read Living in a Unicode World and Unicode Case Folding. But cannot fix my problem. I still cannot make elasticsearch to use correct case folding.

Update

I also try to create my Index like this.

$ curl -X PUT http://localhost:9200/tr-test -d '
{
  "mappings": {
    "names": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer" : "turkish"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "1"
    }
  }
}'

But I am getting same results. My data can be found if I search using BAHADIR or bahadir but it cannot be found by searching bahadır which is correct lowercased version of BAHADIR.


Solution

  • You should try to use the Turkish Language Analyzer in your setting.

    {
      "mappings": {
        "names": {
          "properties": {
            "name": {
              "type":     "string",
              "analyzer": "turkish" 
            }
          }
        }
      }
    }
    

    As you can see in the implementation details, it also defines a turkish_lowercase so I guess it'll take care of your problems for you. If you don't want all the other features of the Turkish Analyzer, define a custom one with only turkish_lowercase

    If you need a full text search on your name field, you should also change the query method to match query, which is the basic full text search method on a single field.

    {
      "query": {
        "match": {
          "name": "bahadır"
        }
      }
    }
    

    On the other hand, query string query is more complex and searches on multiple fields allowing an advanced syntax; It also has an option to pass the analyzer you want to use, so if you really needed this kind of query you should have tried passing "analyzer": "turkish" within the query. I'm not an expert of query string query though.