I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?
It will take for ever if I ran every single file through an OCR processor.
I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.
EDIT: This should get you started:
foreach ($pdffile in get-childitem -filter *.pdf){
$pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
write-host $pdffile.fullname
write-host $pdftext.length;
write-host $pdftext;
write-host "-------------------------------";
}
Unfortunately even when you have only images in your PDF pdftotext
will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.