Search code examples
elasticsearchpdfocrapache-tika

Simple Elasticsearch PDF Text Search using german language


I can handle/extract the text from my PDF-Files, I don't know quite know if I am going the right way about how to store my content in Elasticsearch.

My PDF-Texts are mostly German - with letters like "ö", "ä", etc.

In order to store EVERY character of the content, I "escape" necessary characters and encode them properly to JSON so I can store them.

For example:

I want to store the following (PDF) text:

Öffentliche Verkehrsmittel. TestPath: C:\Windows\explorer.exe

I convert and upload it to Elasticsearch like this:

{"text":"\\u00D6ffentliche Verkehrsmittel. TestPath: C:\\\\Windows\\\\explorer.exe"}

My question is: Is this the right way to store documents like this?


Solution

  • Elasticsearch comes up with a wide range of inbuilt language-specific analyzer and if you are creating the text field and storing your data, by default standard analyzer is used. which you change like below:

    {
    
        "mappings": {
            "properties": {
                "title.german" :{
                    "type" :"text",
                    "analyzer" : "german"
                }
            }
        }
    }
    

    You can also check the tokens generated by language analyzer in your case german using analyze API

    {
        "text" : "Öffentliche",
        "analyzer" : "german"
    }
    

    And generated token

    {
        "tokens": [
            {
                "token": "offentlich",
                "start_offset": 0,
                "end_offset": 11,
                "type": "<ALPHANUM>",
                "position": 0
            }
        ]
    }
    

    Tokens for Ö

    {
        "text" : "Ö",
        "analyzer" : "german"
    }
    
    {
        "tokens": [
            {
                "token": "o",
                "start_offset": 0,
                "end_offset": 1,
                "type": "<ALPHANUM>",
                "position": 0
            }
        ]
    }
    

    Note:- it converted it to plain text, so now whether you search for Ö or ö it will come in the search result, as the same analyzer is applied at query time if you use the match query.