Search code examples
macospdfjpegocrtesseract

Improve quality of image for tesseract OCR


With regard to this question and this question, where I ask how to download thousands of PDF and processes them to extract their texts with OCR, I am hitting a brick wall again when it comes to enhancing the text outputs.

I am interested to extract texts of a bunch of PDF in order to search for surnames in the text (I do not need necessarily to be able to read the rest of the text). The PDF represent old newspaper articles, published between 1810 and 1832 and written in German Fraktur. This font seems to be particularly challenging for tesseract.

Q: How can I further improve the image quality for tesseract to - at least - have a change to find the surnames in the text? Which procedure would you suggest?

If we take this pdf as an example, I receive the following image when applying

convert -colorspace GRAY -resize 3000x -units PixelsPerInch example.pdf example-page.jpg

enter image description here

If I now use tesseract with

tesseract --tessdata-dir /usr/local/share/tessdata/ -l deu_frak example-page.jpg example-page.txt

it would perform terrible on that image with roughly 360 diacritics detected only. My text output is entirely scrambled.

When I use Fred's ImageMagick script textcleaner, applying either

textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10

or

textcleaner -g -e stretch -f 25 -o 20 -t 30 -u -s 1 -T -p 20

I get something like this

enter image description here

When I then run again tesseract with the above mentioned command, the resulting text is much better (around 700-800 diacritics detected) but still scrambled enough not to find most surnames of the text.

I know that the example page is a particular hard one, however, even pages, which are not inky prints and not skewed to begin with, yield mostly scrambled outputs and undecipherable surnames when processing them with tesseract and the above command.

For example this page

enter image description here

Q: How can I further improve the image quality for tesseract to - at least - have a change to find the surnames in the text? Which procedure would you suggest?

Edit: I do not know, whether training tesseract is needed or a good idea to deal with the given German Fraktur font, as GUI box editor seems to work reliably on MacOS, see for example, jTessBoxEditor, Qt-box-editor, or Tesseract-Box-Editor, nor did I understand how to train tesseract, see the tesseract training wiki here and another tutorial here.


Solution

  • My father had a similar problem with his old newspaper clippings, and I had moderately good success by preprocessing with GhostScript and then applying Tesseract. Your mileage may vary. My commands (Windows) were

    set nm=%1
    set d=%2
    "C:\Program Files\gs\gs9.21\bin\gswin32.exe" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r150 -dTextAlphaBits=4 -sOutputFile="%nm%-%%00d.pgm" %nm%.pdf
    echo. 2>"%nm%.txt"
    
    for %%f in (%nm%*.pgm) do (
        echo %%~nf
        "C:\Program Files\Tesseract-OCR\tesseract.exe" "%%~nf.pgm" "%%~nf"
        cat "%%~nf.txt" >> "%nm%.txt"   
        del  "%%~nf.pgm"
        del  "%%~nf.txt"
    )
    "C:\Program Files\Microsoft Office\Office11\winword.exe" "%nm%.txt"
    

    EDIT: response to comment

    First, install ghostscript on your mac. See https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript#Installing_Ghostscript_on_Mac_OS_X

    Then do

    gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r150 -dTextAlphaBits=4 -sOutputFile='paper-%00d.pgm' paper.pdf
    

    This will create rasterized files paper-01.pgm, paper-02.pgm etc (this is in case your pdf has multiple pages). You can replace the "paper" with the basename of your original pdf. You can also mess with the resolution. That and other things can be found at https://ghostscript.com/doc/9.19/Use.htm

    Then use tesseract on each pgm file.