I can handle/extract the text from my PDF-Files, I don't know quite know if I am going the right way about how to store my content in Elasticsearch.
My PDF-Texts are mostly German - with letters like "ö", "ä", etc.
In order to store EVERY character of the content, I "escape" necessary characters and encode them properly to JSON so I can store them.
For example:
I want to store the following (PDF) text:
Öffentliche Verkehrsmittel. TestPath: C:\Windows\explorer.exe
I convert and upload it to Elasticsearch like this:
{"text":"\\u00D6ffentliche Verkehrsmittel. TestPath: C:\\\\Windows\\\\explorer.exe"}
My question is: Is this the right way to store documents like this?
Elasticsearch comes up with a wide range of inbuilt language-specific analyzer and if you are creating the text field and storing your data, by default standard analyzer is used. which you change like below:
{
"mappings": {
"properties": {
"title.german" :{
"type" :"text",
"analyzer" : "german"
}
}
}
}
You can also check the tokens generated by language analyzer in your case german using analyze API
{
"text" : "Öffentliche",
"analyzer" : "german"
}
And generated token
{
"tokens": [
{
"token": "offentlich",
"start_offset": 0,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Tokens for Ö
{
"text" : "Ö",
"analyzer" : "german"
}
{
"tokens": [
{
"token": "o",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Note:- it converted it to plain text, so now whether you search for Ö
or ö
it will come in the search result, as the same analyzer is applied at query time if you use the match query.