Search code examples
ocrtesseract

Tesseract does not recognize german "für"


I use the tesseract 4.0 via docker image tesseractshadow/tesseract4re

I use the option -l=deu to give tesseract the hint, that the text is in "deutsch" (german).

Still the result for the german word "für" is not good. The german word is very common (meaning "for" in english).

Tesseract often detects "fiir" or "fur".

What can I do to improve this?

reproducible example

docker run --name self.container_name --rm \
    --volume  $PWD:/pwd \
    tesseractshadow/tesseract4re \
    tesseract /pwd/die-fuer-das.png /pwd/die-fuer-das.png.ocr-result -l=deu

Result:

cat die-fuer-das.png.ocr-result.txt 
die fur das

Image die_fuer_das.png:

enter image description here


Solution

  • I found the solution. It needs to be -l deu otherwise the german language does not get used. I accidentally used -l=deu.

    Works:

    ===> tesseract  die-fuer-das.png out  -l deu; cat out.txt
    Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
    die für das
    

    Wrong language:

    ===> tesseract  die-fuer-das.png out  -l=deu; cat out.txt
    Tesseract Open Source OCR Engine v4.0.0-beta.1-262-g555f with Leptonica
    die fur das