Search code examples
javaelasticsearchanalysiskuromoji

Behavior of elasticsearch-analysis-kuromoji is not what i expected


I have been using elasticsearch-analysis-kuromoji to perform searches in Japanese, but I have been getting two very strange behaviours, the first one being that the characters I search for will not work, like - '輸出貿易' will not work unless I pass it as '輸 出 貿 易' with spaces between each character. Also characters like ント are not searched on.

This is my configuration:

            .setSettings(ImmutableSettings.settingsBuilder().loadFromSource(jsonBuilder()
                    .startObject()
                    .startObject("analysis")
                            //
                    .startObject("tokenizer")
                    .startObject("kuromoji_user_dict")
                    .field("type", "kuromoji_tokenizer")
                    .field("mode", "extended")
                    .field("discard_punctuation", "false")
                    .endObject()
                    .endObject()
                            //
                    .startObject("analyzer")
                    .startObject(JAPANESE_LANGUAGE_ANALYSIS)
                    .field("type", "custom")
                    .field("tokenizer", "kuromoji_user_dict")
                    .endObject()
                    .endObject()
                            //

                    .endObject()
                    .endObject().string()));

Am I configuring it wrong or do I need a different tokeniser for character like: '輸出貿易 and ント'

Thank You


Solution

  • After some online research and some help from the elasticsearch-analysis-kuromoji team I was able to find the problem, even though I created the analyst and told the query to use it, I also need to add the mapping like so:

    XContentBuilder xbMapping =
            jsonBuilder()
                    .startObject()
                    .startObject(indexType)
                    .startObject("properties")
                    .startObject("source")
                    .field("type", "string")
                    .endObject()
                    .startObject("text")
                    .field("type", "string")
                    .field("analyzer", JAPANESE_LANGUAGE_ANALYSIS)
                    .endObject()
                    .endObject()
                    .endObject()
                    .endObject();
    
    elasticSearchClient.admin().indices()
            .preparePutMapping(indexName)
            .setType(indexType)
            .setSource(xbMapping)
            .execute().get();