Search code examples
pdfoutputtesseract

Running tesseract 4.1 with openjpeg2 - cannot produce pdf output


I have installed on my RedHat machine:

(py36_maw) [rvp@lib-archcoll box]$ tesseract -v
tesseract 4.1.0
 leptonica-1.78.0
  libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libopenjp2 2.3.1
 Found SSE

I try to run, per what docs I can find, to produce pdf output:

(py36_maw) [rvp@lib-archcoll box]$ time tesseract test.jp2 out -l eng PDF
read_params_file: Can't open PDF
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 275

That takes 10 seconds and produces file out.txt with fine OCR to text conversion evident.

However, it tries to read a file called PDF, but I cannot figure how to get PDF output.

I have read various docs, the most promising seeming to be advising to edit the config file, but the only docs I can guess are relevant, by googling 'tesseract 4.1 config', list many 'config' variable names, for older versions of tesseract, but none of which seems to indicate I can specify producing pdf output, much less specifically for tesseract 4.1.

How can I invoke tesseract 4.1 (using libopenjp2 2.3.1) via CLI to produce pdf output from my jp2 input file? Bonus question: how can I get it to produce both txt and pdf output in one run?

Robert


Solution

  • After more surfing and digging, assuming the reader also has done some and knows what TESSDATA_PREFIX is used for by tesseract, here are the steps that worked for me:

    1. Download the pdf.ttf file from: https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/pdf.ttf
    2. Copy pdf.ttf to your directory $TESSDATA_PREFIX and make sure that variable is exported to your shell.
    3. TIP: Use command: tesseract --print-parameters # to discover defined variable names you can use in your own config file
    4. Go to your dir with the test.jp2 file and create file config with these lines.
    tessedit_create_pdf     1       Write .pdf output file
    tessedit_create txt     1       Write .txt output file
    

    (Note: or you may be able to put the config file in the TESSDATA_PREFIX directory as well and let it always be the default. Not tested.)

    1. Run in that dir:

    $ tesseract test.jp2 outputbase -l eng config

    1. Verify your success: it runs and produces files outputbase.txt and outputbase.pdf. The txt file looks good and the searchable pdf looks and works OK in a pdf viewer, that is, you can search and find text strings.

    Hope this helps someone else!