Search code examples
pdfocrtess4j

Tess4J doOCR() for *First Page* of pdf / tif


Is there a way to tell Tess4J to only OCR a certain amount of pages / characters?

I will potentially be working with 200+ page PDF's, but I really only want to OCR the first page, if that!

As far as I understand, the common sample

package net.sourceforge.tess4j.example;

import java.io.File;
import net.sourceforge.tess4j.*;

    public class TesseractExample {

        public static void main(String[] args) {
            File imageFile = new File("eurotext.tif");
            Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
            // Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping

            try {
                String result = instance.doOCR(imageFile);
                System.out.println(result);
            } catch (TesseractException e) {
                System.err.println(e.getMessage());
            }
        }
    }

Would attempt to OCR the entire, 200+ page into a single String.

For my particular case, that is way more than I need it to do, and I'm worried it could take a very long time if I let it do all 200+ pages and then just substring the first 500 or so.


Solution

  • The library has a PdfUtilities class that can extract certain pages of a PDF.