Search code examples
macospdfimagemagickocrtesseract

Batch OCR of 5800+ PDF written in German Fraktur


I would like to batch OCR about 5800 PDF (consisting each between 2 to 6 pages from my last question here) with open source command line tools on a Mac. The main propose of this adventure is that I want to retrieve as reliable as I can names (surnames most importantly) from the text of all these PDF. Here is an example how an issue looks like.

At this point, I do not know exactly how to proceed. What would you do?

I had in mind to first convert all multipage PDF to a single page image as either png, jpg, or tif and move all images related to one PDF into a respective folder with the following command:

time for i in *.pdf; do mkdir "${i%.pdf}"; convert -colorspace GRAY -resize 3000x -units PixelsPerInch  "$i" "${i%.pdf}.jpg”; mv *.jpg "${i%.pdf}"; done 

As a second step, I would have the issue that my OCR script would need to enter each folder, do its magic, and leave it in order to proceed with the next. I do not know how to write this. The core of the script would be:

tesseract --tessdata-dir /usr/local/share/tessdata/ --oem 3 --psm 11 -l deu_frak *.jpg test.txt 

As the PDF represent old newspaper article, which have been published almost every day between 1810 and 1832, they are written in German Fraktur. This font seems to be particularly challenging for tesseract. My text output is normally scrambled, e. g. on the above linked article I would get only between 791 to 801 diacritics detected for the first page. Names are at risk not be identified as such depending on the chosen options.

At the end, I would use the silver searcher to look for names within all 5800 txt-files, I hope to obtain.

time rg -i search_term_here

Finally, how can make sure that I get the best possible OCR output so that I obtain most of the (sur)names in the texts?

P.S.: When will by the way tesseract 4 be around for Mac and with German Fraktur training data?

Edit:

These are the commands, I have used to achieve what I wanted. Although the output of tesseract could still be improved a great deal.

Convert each PDF into jpg and move them to respective folders to keep order:

time parallel -j 8 'mkdir {.} && convert {} -colorspace GRAY -resize 3000x -units PixelsPerInch {.}/{.}.jpg' ::: *.pdf

Using Fred's ImageMagick script textcleaner (which I have moved to /usr/local/bin/ for usability), to enhance the tesseract output a bit:

time find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}

Parallelising the tesseract analyses:

time find . -name \*.jpg | parallel -j 8 “tesseract {} {.}.txt —tessdata-dir /usr/local/share/tessdata/ -l deu_frak”

Search for the surnames with the silver searcher:

time rg -t txt -i term

Solution

  • First, I would recommend you install homebrew if you have not already - it is an excellent package manager for the Mac.

    Then I would recommend you install the Poppler package to get the pdfimages tool:

    brew install poppler
    

    You can then extract images from a PDF like this:

    pdfimages SomeFile.pdf root
    

    and you will get files named root-000.ppm and root-001.ppm which will work fine with tesseract. Or you can add -png if you want PNG images. I would avoid JPEG because of lossy compression.

    If you can get that working, I would then suggest you install GNU Parallel with:

    brew install parallel
    

    and we can work on doing OCR in parallel down the line.


    PLEASE TRY THE FOLLOWING ONLY IN A SMALL DIRECTORY WITH 5-6 COPIES OF YOUR ORIGINALS

    We can also extract the images in parallel using GNU Parallel like this:

    parallel 'mkdir {.} && pdfimages {} {.}/{.}' ::: *pdf
    

    As regards using Fred's textcleaner with GNU Parallel, and wanting to overwrite the JPEGs, I think you will want something like this:

    find . -name \*.jpg | parallel textcleaner -g -e stretch -f 25 -o 10 -u -s 1 -T -p 10 {} {}