I have an issue where I want to use a managed Elasticsearch service, but they specifically do not have a plugin that I need. The plugin is the pinyin plugin, which provides a custom tokenizer. My thought is to replicate this tokenization in a preprocessing step before I insert into Elasticsearch.
For instance, if I call _analyze?text=%e5%88%98%e5%be%b7%e5%8d%8e&analyzer=pinyin_analyzer
I receive the output
{
"tokens": [
{
"token": "ldh",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "liu",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "hua",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 4
}
]
}
I have a way to generate tokens like this in a preprocessing step, but is it possible to then insert them pre-analyzed into the Elasticsearch index?
You can create an array of tokenised values. The effect will be same. Moreover if you are doing all the preprocessing and not just tokenising, use keyword field. Otherwise your tokens will get analysed again individually.