With regard to this question and this question, where I ask how to download thousands of PDF
and processes them to extract their texts with OCR
, I am hitting a brick wall again when it comes to enhancing the text outputs.
I am interested to extract texts of a bunch of PDF
in order to search for surnames in the text (I do not need necessarily to be able to read the rest of the text). The PDF
represent old newspaper articles, published between 1810 and 1832 and written in German Fraktur. This font seems to be particularly challenging for tesseract
.
Q: How can I further improve the image quality for tesseract
to - at least - have a change to find the surnames in the text? Which procedure would you suggest?
If we take this pdf as an example, I receive the following image when applying
convert -colorspace GRAY -resize 3000x -units PixelsPerInch example.pdf example-page.jpg
If I now use tesseract
with
tesseract --tessdata-dir /usr/local/share/tessdata/ -l deu_frak example-page.jpg example-page.txt
it would perform terrible on that image with roughly 360 diacritics detected only. My text output is entirely scrambled.
When I use Fred's ImageMagick script textcleaner, applying either
textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10
or
textcleaner -g -e stretch -f 25 -o 20 -t 30 -u -s 1 -T -p 20
I get something like this
When I then run again tesseract
with the above mentioned command, the resulting text is much better (around 700-800 diacritics detected) but still scrambled enough not to find most surnames of the text.
I know that the example page is a particular hard one, however, even pages, which are not inky prints and not skewed to begin with, yield mostly scrambled outputs and undecipherable surnames when processing them with tesseract
and the above command.
For example this page
Q: How can I further improve the image quality for tesseract
to - at least - have a change to find the surnames in the text? Which procedure would you suggest?
Edit: I do not know, whether training tesseract is needed or a good idea to deal with the given German Fraktur font, as GUI box editor seems to work reliably on MacOS, see for example, jTessBoxEditor, Qt-box-editor, or Tesseract-Box-Editor, nor did I understand how to train tesseract, see the tesseract training wiki here and another tutorial here.
My father had a similar problem with his old newspaper clippings, and I had moderately good success by preprocessing with GhostScript and then applying Tesseract. Your mileage may vary. My commands (Windows) were
set nm=%1
set d=%2
"C:\Program Files\gs\gs9.21\bin\gswin32.exe" -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r150 -dTextAlphaBits=4 -sOutputFile="%nm%-%%00d.pgm" %nm%.pdf
echo. 2>"%nm%.txt"
for %%f in (%nm%*.pgm) do (
echo %%~nf
"C:\Program Files\Tesseract-OCR\tesseract.exe" "%%~nf.pgm" "%%~nf"
cat "%%~nf.txt" >> "%nm%.txt"
del "%%~nf.pgm"
del "%%~nf.txt"
)
"C:\Program Files\Microsoft Office\Office11\winword.exe" "%nm%.txt"
EDIT: response to comment
First, install ghostscript on your mac. See https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript#Installing_Ghostscript_on_Mac_OS_X
Then do
gs -dSAFER -dBATCH -dNOPAUSE -sDEVICE=pgmraw -r150 -dTextAlphaBits=4 -sOutputFile='paper-%00d.pgm' paper.pdf
This will create rasterized files paper-01.pgm, paper-02.pgm etc (this is in case your pdf has multiple pages). You can replace the "paper" with the basename of your original pdf. You can also mess with the resolution. That and other things can be found at https://ghostscript.com/doc/9.19/Use.htm
Then use tesseract on each pgm file.