Search code examples
linuxpython-3.xocrtesseract

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files


I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.

tesseract infile outfile -l eng myconfig
  • infile contains a list of image paths to process
  • myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)

This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.

What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...

Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.

I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...

I’m working with both Python 3 and Linux commands at my disposal.


Solution

  • Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).