How to make Tess4J get image from PDF file?
I'm sarted on the transformation image file to text using OCR (Tess4J). It works fine, I have tested on image and it is great.
File imageFile = new File("D:\\HEAD2.png");
Tesseract instance = Tesseract.getInstance(); // JNA Interface Mapping
// Tesseract1 instance = new Tesseract1(); // JNA Direct Mapping
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
But I'm facing this problem. I would parse a pdf file that contains image so. I don't kow how to do And I have not found any exemple Tess4J with pdf
I tested this example with Asprise, but I don't find any example like this on Tess4J
import com.asprise.util.pdf.PDFReader;
import com.asprise.util.ocr.OCR;
PDFReader reader = new PDFReader(new File("my.pdf"));
reader.open(); // open the file.
int pages = reader.getNumberOfPages();
for(int i=0; i < pages; i++) {
BufferedImage img = reader.getPageAsImage(i);
// recognizes both characters and barcodes
String text = new OCR().recognizeAll(image);
System.out.println("Page " + i + ": " + text);
}
reader.close(); // finally, close the file.
make use of pdfutilities.convertpdf2png and use it like you did before with images.