Search code examples
rocrtesseract

Why does tesseract ignore a whole digit when it reads the same digit next to it just fine


This is a bit of a conundrum for me.

In the image below tesseract package in R totally ignores the second occurrence of 1 on the fourth line, no matter what I do (meaning, it reads it as 1 instead of 11). The image here is already preprocessed - upscaled via nn, cleaned, and binarized. It's the same thing even if I just lightly preprocess the source image.

Cropping the noise on the right does not help. Changing the tessedit_pageseg_mode options can only make things worse, but does not help with this particular problem.

Where the heck did the 1 go? I need to know for the sake of my sanity.

enter image description here


Solution

  • While waiting for R to compile tesseract package, I tested the command line version:

    $ tesseract --version
    tesseract 4.1.1
      leptonica-1.79.0 #...etc
    $ tesseract ocr_test.png  test
    obec TREBOHOSTICE 2021
    okres Strakonice, Jihocesky kraj
    
    Poéet osob starSich 15 let 274
    Poéet osob v exekuci 11
    Podil osob v exekuci 4,01 %
    Celkovy pocet exekuci 106
    Prumérny poéet exekuci na osobu 9.6
    Z toho:
    
    podil (pocet) osob s 1 — 9 exekucemi 45% (5)
    podil (pocet) osob s 10 a vice exekucemi 55% (6)
    
    PM. 2
    

    CLI output looks good. Might be to do with the underlying versions of leptonica installed on your system

    \\

    Clean compile of R tesseract package plus Linux packages:

    #Linux command line
    $ sudo apt install libpoppler-cpp-dev libtesseract-dev libleptonica-dev
    
    #In R
    install.packages("tesseract")  # version 5.1.0
    library(tesseract)
    ocr(file.choose())
    

    Output of row 4 11 looks good:

    obec TREBOHOSTICE 2021
    okres Strakonice, Jihocesky kraj
    
    Poéet osob starSich 15 let 274
    Poéet osob v exekuci 11
    Podil osob v exekuci 401% |
    Celkovy pocet exekuci 106
    Prumérny poéet exekuci na osobu 9.6
    Z toho: on
    podil (pocet) osob s 1 — 9 exekucemi 45% (5) ;
    podil (pocet) osob s 10 a vice exekucemi 55% (6) >