Search code examples
ocrtesseract

Tesseract multiple output format


My context

I'm using tesseract to extract text from an image.

I'm generating a .tsv to retrieve the extracted text and perform some regex on it and a .pdf to have a searchable pdf.

The way I do it is by calling tesseract 2 times:

  • One asking for the .tsv
  • One asking for the .pdf

But I feel like this is not very efficient (the same computations must be made two times)

What I wish

I wish to make my computations go faster. And my idea is to call tesseract only once but specifying two output formats

Is it possible? If so how?


Solution

  • You can try the command:

    tesseract yourimage.tif out pdf tsv