I have been using elasticsearch-analysis-kuromoji to perform searches in Japanese, but I have been getting two very strange behaviours, the first one being that the characters I search for will not work, like - '輸出貿易' will not work unless I pass it as '輸 出 貿 易' with spaces between each character. Also characters like ント are not searched on.
This is my configuration:
.setSettings(ImmutableSettings.settingsBuilder().loadFromSource(jsonBuilder()
.startObject()
.startObject("analysis")
//
.startObject("tokenizer")
.startObject("kuromoji_user_dict")
.field("type", "kuromoji_tokenizer")
.field("mode", "extended")
.field("discard_punctuation", "false")
.endObject()
.endObject()
//
.startObject("analyzer")
.startObject(JAPANESE_LANGUAGE_ANALYSIS)
.field("type", "custom")
.field("tokenizer", "kuromoji_user_dict")
.endObject()
.endObject()
//
.endObject()
.endObject().string()));
Am I configuring it wrong or do I need a different tokeniser for character like: '輸出貿易 and ント'
Thank You
After some online research and some help from the elasticsearch-analysis-kuromoji team I was able to find the problem, even though I created the analyst and told the query to use it, I also need to add the mapping like so:
XContentBuilder xbMapping =
jsonBuilder()
.startObject()
.startObject(indexType)
.startObject("properties")
.startObject("source")
.field("type", "string")
.endObject()
.startObject("text")
.field("type", "string")
.field("analyzer", JAPANESE_LANGUAGE_ANALYSIS)
.endObject()
.endObject()
.endObject()
.endObject();
elasticSearchClient.admin().indices()
.preparePutMapping(indexName)
.setType(indexType)
.setSource(xbMapping)
.execute().get();