Elasticsearch uses wrong Case Folding for Unicode Characters

In one of my project, I am trying to use Elasticsearch (1.7) to query data. But, It returns different result for unicode characters depending on if they are uppercased or not. I try to use icu_analyzer to get rid of problem.

Here is a small example to demonstrate my problem. My index is like this,

$ curl -X PUT http://localhost:9200/tr-test -d '
{
  "mappings": {
    "names": {
      "properties": {
        "name": {
          "type": "string"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "1",
      "analysis": {
        "filter": {
          "nfkc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfkc"
          }
        },
        "analyzer": {
          "my_lowercaser": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfkc_normalizer"
            ]
          }
        }
      }
    }
  }
}'

Here is a test data to demonstrate my problem.

$ curl -X POST http://10.22.20.140:9200/tr-test/_bulk -d '
{"index": {"_type":"names", "_index":"tr-test"}}
{"name":"BAHADIR"}'

Here is a similar query. If I query using BAHADIR as query_string, I can easily find my test data.

$ curl -X POST http://10.22.20.140:9200/tr-test/_search -d '
{
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "query": "BAHADIR"
        }
      }
    }
  }
}'

In Turkish, lowercased version of of BAHADIR is bahadır. I am expecting same result while querying with bahadır. But Elasticsearch cannot find my data. And I cannot fix that with using ICU for analysis. It works perfectly fine if I query with bahadir.

I already read Living in a Unicode World and Unicode Case Folding. But cannot fix my problem. I still cannot make elasticsearch to use correct case folding.

Update

I also try to create my Index like this.

$ curl -X PUT http://localhost:9200/tr-test -d '
{
  "mappings": {
    "names": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer" : "turkish"
        }
      }
    }
  },
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "1"
    }
  }
}'

But I am getting same results. My data can be found if I search using BAHADIR or bahadir but it cannot be found by searching bahadır which is correct lowercased version of BAHADIR.

Solution

You should try to use the Turkish Language Analyzer in your setting.

{
  "mappings": {
    "names": {
      "properties": {
        "name": {
          "type":     "string",
          "analyzer": "turkish" 
        }
      }
    }
  }
}

As you can see in the implementation details, it also defines a turkish_lowercase so I guess it'll take care of your problems for you. If you don't want all the other features of the Turkish Analyzer, define a custom one with only turkish_lowercase

If you need a full text search on your name field, you should also change the query method to match query, which is the basic full text search method on a single field.

{
  "query": {
    "match": {
      "name": "bahadır"
    }
  }
}

On the other hand, query string query is more complex and searches on multiple fields allowing an advanced syntax; It also has an option to pass the analyzer you want to use, so if you really needed this kind of query you should have tried passing "analyzer": "turkish" within the query. I'm not an expert of query string query though.