Search code examples
javaparsingpdftesseract

Make Tess4J get image from PDF file


How to make Tess4J get image from PDF file?

I'm sarted on the transformation image file to text using OCR (Tess4J). It works fine, I have tested on image and it is great.

File imageFile = new File("D:\\HEAD2.png");
Tesseract instance = Tesseract.getInstance();  // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping

try {
    String result = instance.doOCR(imageFile);
    System.out.println(result);
} catch (TesseractException e) {
    System.err.println(e.getMessage());
}

But I'm facing this problem. I would parse a pdf file that contains image so. I don't kow how to do And I have not found any exemple Tess4J with pdf

I tested this example with Asprise, but I don't find any example like this on Tess4J

import com.asprise.util.pdf.PDFReader;
import com.asprise.util.ocr.OCR;

PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file. 
int pages = reader.getNumberOfPages();

for(int i=0; i < pages; i++) {
   BufferedImage img = reader.getPageAsImage(i);

   // recognizes both characters and barcodes
   String text = new OCR().recognizeAll(image);
   System.out.println("Page " + i + ": " + text); 
}

reader.close(); // finally, close the file.

Solution

  • make use of pdfutilities.convertpdf2png and use it like you did before with images.