In one of my project, I am trying to use Elasticsearch (1.7) to query data. But, It returns different result for unicode characters depending on if they are uppercased or not. I try to use icu_analyzer to get rid of problem.
Here is a small example to demonstrate my problem. My index is like this,
$ curl -X PUT http://localhost:9200/tr-test -d '
{
"mappings": {
"names": {
"properties": {
"name": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1",
"analysis": {
"filter": {
"nfkc_normalizer": {
"type": "icu_normalizer",
"name": "nfkc"
}
},
"analyzer": {
"my_lowercaser": {
"tokenizer": "icu_tokenizer",
"filter": [
"nfkc_normalizer"
]
}
}
}
}
}
}'
Here is a test data to demonstrate my problem.
$ curl -X POST http://10.22.20.140:9200/tr-test/_bulk -d '
{"index": {"_type":"names", "_index":"tr-test"}}
{"name":"BAHADIR"}'
Here is a similar query. If I query using BAHADIR
as query_string
, I can easily find my test data.
$ curl -X POST http://10.22.20.140:9200/tr-test/_search -d '
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "BAHADIR"
}
}
}
}
}'
In Turkish, lowercased version of of BAHADIR
is bahadır
. I am expecting same result while querying with bahadır
. But Elasticsearch cannot find my data. And I cannot fix that with using ICU for analysis. It works perfectly fine if I query with bahadir
.
I already read Living in a Unicode World and Unicode Case Folding. But cannot fix my problem. I still cannot make elasticsearch to use correct case folding.
I also try to create my Index like this.
$ curl -X PUT http://localhost:9200/tr-test -d '
{
"mappings": {
"names": {
"properties": {
"name": {
"type": "string",
"analyzer" : "turkish"
}
}
}
},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1"
}
}
}'
But I am getting same results. My data can be found if I search using BAHADIR
or bahadir
but it cannot be found by searching bahadır
which is correct lowercased version of BAHADIR
.
You should try to use the Turkish Language Analyzer in your setting.
{
"mappings": {
"names": {
"properties": {
"name": {
"type": "string",
"analyzer": "turkish"
}
}
}
}
}
As you can see in the implementation details, it also defines a turkish_lowercase
so I guess it'll take care of your problems for you. If you don't want all the other features of the Turkish Analyzer, define a custom one with only turkish_lowercase
If you need a full text search on your name
field, you should also change the query method to match query, which is the basic full text search method on a single field.
{
"query": {
"match": {
"name": "bahadır"
}
}
}
On the other hand, query string query is more complex and searches on multiple fields allowing an advanced syntax; It also has an option to pass the analyzer
you want to use, so if you really needed this kind of query you should have tried passing "analyzer": "turkish"
within the query. I'm not an expert of query string query though.