Improve quality of image for tesseract OCR

With regard to this question and this question, where I ask how to download thousands of PDF and processes them to extract their texts with OCR, I am hitting a brick wall again when it comes to enhancing the text outputs.

I am interested to extract texts of a bunch of PDF in order to search for surnames in the text (I do not need necessarily to be able to read the rest of the text). The PDF represent old newspaper articles, published between 1810 and 1832 and written in German Fraktur. This font seems to be particularly challenging for tesseract.

Q: How can I further improve the image quality for tesseract to - at least - have a change to find the surnames in the text? Which procedure would you suggest?

If we take this pdf as an example, I receive the following image when applying

convert -colorspace GRAY -resize 3000x -units PixelsPerInch example.pdf example-page.jpg

If I now use tesseract with

tesseract --tessdata-dir /usr/local/share/tessdata/ -l deu_frak example-page.jpg example-page.txt

it would perform terrible on that image with roughly 360 diacritics detected only. My text output is entirely scrambled.

When I use Fred's ImageMagick script textcleaner, applying either

textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10

textcleaner -g -e stretch -f 25 -o 20 -t 30 -u -s 1 -T -p 20

I get something like this

When I then run again tesseract with the above mentioned command, the resulting text is much better (around 700-800 diacritics detected) but still scrambled enough not to find most surnames of the text.

I know that the example page is a particular hard one, however, even pages, which are not inky prints and not skewed to begin with, yield mostly scrambled outputs and undecipherable surnames when processing them with tesseract and the above command.

For example this page

Q: How can I further improve the image quality for tesseract to - at least - have a change to find the surnames in the text? Which procedure would you suggest?

Edit: I do not know, whether training tesseract is needed or a good idea to deal with the given German Fraktur font, as GUI box editor seems to work reliably on MacOS, see for example, jTessBoxEditor, Qt-box-editor, or Tesseract-Box-Editor, nor did I understand how to train tesseract, see the tesseract training wiki here and another tutorial here.

Solution

My father had a similar problem with his old newspaper clippings, and I had moderately good success by preprocessing with GhostScript and then applying Tesseract. Your mileage may vary. My commands (Windows) were

set nm=%1
set d=%2
"C:\Program Files\gs\gs9.21\bin\gswin32.exe" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r150 -dTextAlphaBits=4 -sOutputFile="%nm%-%%00d.pgm" %nm%.pdf
echo. 2>"%nm%.txt"

for %%f in (%nm%*.pgm) do (
    echo %%~nf
    "C:\Program Files\Tesseract-OCR\tesseract.exe" "%%~nf.pgm" "%%~nf"
    cat "%%~nf.txt" >> "%nm%.txt"   
    del  "%%~nf.pgm"
    del  "%%~nf.txt"
)
"C:\Program Files\Microsoft Office\Office11\winword.exe" "%nm%.txt"

EDIT: response to comment

First, install ghostscript on your mac. See https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript#Installing_Ghostscript_on_Mac_OS_X

Then do

gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r150 -dTextAlphaBits=4 -sOutputFile='paper-%00d.pgm' paper.pdf

This will create rasterized files paper-01.pgm, paper-02.pgm etc (this is in case your pdf has multiple pages). You can replace the "paper" with the basename of your original pdf. You can also mess with the resolution. That and other things can be found at https://ghostscript.com/doc/9.19/Use.htm

Then use tesseract on each pgm file.