OCR of PDF files with images

I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images

Solution

There are 2 important flags that tika uses to extract text:

X-Tika-PDFextractInlineImages (true/false). When false than all images is ignored. So it works fine for the native pdfs - the text is extracted from the native pdf When true than images will be used to text extraction
X-Tika-PDFocrStrategy: https://tika.apache.org/1.24/api/org/apache/tika/parser/pdf/PDFParserConfig.OCR_STRATEGY.html NO_OCR - extract the text without ocr - works for native pdfs OCR_ONLY - only the ocr is used - so the text from "native pdf" is also send to ocr OCR_AND_TEXT_EXTRACTION - invokes NO_OCR OCR_ONLY

so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR seems to be the best

for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY

but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika