I'm using Elasticsearch 0.90.1 with Kuromoji plugin 1.4.0.
$ curl localhost:9200
{
"ok" : true,
"status" : 200,
"name" : "Agent Zero",
"version" : {
"number" : "0.90.1",
"snapshot_build" : false,
"lucene_version" : "4.3"
},
"tagline" : "You Know, for Search"
}
I create a new index, using Kuromoji for my default
analyzer:
$ curl -X PUT localhost:9200/test -d '{
"index": {
"analysis": {
"filter": {
"kuromoji_rf": {
"type": "kuromoji_readingform",
"use_romaji": "false"
}
},
"tokenizer": {
"kuromoji": {
"type": "kuromoji_tokenizer"
}
},
"analyzer": {
"default": {
"type": "custom",
"tokenizer": "kuromoji",
"filter": [
"kuromoji_rf"
]
}
}
}
}
}'
result:
{
"ok": true,
"acknowledged": true
}
The reading form token filter seems to be working fine (kanji is normalized to katakana):
$ curl localhost:9200/test/_analyze -d '東京'
result:
{
"tokens": [
{
"token": "トウキョウ",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 1
}
]
}
Index a document:
$ curl -X PUT localhost:9200/test/docs/1 -d '{
"body": "これは関西国際空港です"
}'
result:
{
"ok": true,
"_index": "test",
"_type": "docs",
"_id": "1",
"_version": 1
}%
The indexed document matches a wildcard query:
$ curl 'localhost:9200/test/docs/_search?q=body:*'
result:
{
"took": 109,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.0,
"hits": [
{
"_index": "test",
"_type": "docs",
"_id": "1",
"_score": 1.0,
"_source": {
"body": "これは関西国際空港です"
}
}
]
}
}
However, it doesn't match when I search using Japanese:
$ curl 'localhost:9200/test/docs/_search?q=body:空港'
result:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
$ curl 'localhost:9200/test/docs/_search?q=body:クウコウ'
result:
{
"took": 95,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
$ curl 'localhost:9200/test/docs/_search?q=body:空'
result:
{
"took": 22,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
I wondered if maybe the analyzer was not being used for the search query, but specifying the analyzer does not help:
$ curl 'localhost:9200/test/docs/_search?analyzer=default&q=body:空港'
result:
{
"took": 17,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
By the way, everything works fine if I disable the token filter.
What am I doing wrong?
Maybe your URL(e.x. localhost:9200/test/docs/_search?q=body:クウコウ
) is not URL encoded string.
I try following command, return results.
"クウコウ" -> "%E3%82%AF%E3%82%A6%E3%82%B3%E3%82%A6"
curl 'http://localhost:9200/test/docs/_search?q=body:%E3%82%AF%E3%82%A6%E3%82%B3%E3%82%A6'
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.11506981,
"hits": [
{
"_index": "test",
"_type": "docs",
"_id": "1",
"_score": 0.11506981,
"_source": {
"body": "これは関西国際空港です"
}
}
]
}
}