I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice. Is there a way to avoid this? Even if it has to make two passes, one for the straight text and then another for just the images
There are 2 important flags that tika uses to extract text:
so when you have the fully native pdf then the combination X-Tika-PDFextractInlineImages: false, X-Tika-PDFocrStrategy: NO_OCR
seems to be the best
for the fully scanned pdfs you can use X-Tika-PDFextractInlineImages: true, X-Tika-PDFocrStrategy: OCR_ONLY
but probably your document is a hybrid. It contains the native parts (you need to extract text only) and the images (you need to ocr it). In my opinion there is no way to handle hybrid pdf in tika