I use Ghostscript to strip images from PDF files into jpg and run Tesseract to save txt content like this:
Code:
$pathgs = "c:\\engine\\gs\\";
$pathtess = "c:\\engine\\tesseract\\";
$pathfile = "file/tmp/"
// Strip images
putenv("PATH=".$pathgs);
$exec = "gs -dNOPAUSE -sDEVICE=jpeg -r300 -sOutputFile=".$pathfile."strip%d.jpg ".$pathfile."upload.pdf -q -c quit";
shell_exec($exec);
// OCR
putenv("PATH=".$pathtess);
$exec = "tesseract.exe '".$pathfile."strip1.jpg' '".$pathfile."ocr' -l eng";
exec($exec, $msg);
print_r($msg);
echo file_get_contents($pathfile."ocr.txt");
Stripping the image (its just 1 page) works fine, but Tesseract echoes:
Array
(
[0] => Tesseract Open Source OCR Engine v3.01 with Leptonica
[1] => Cannot open input file: 'file/tmp/strip1.jpg'
)
and no ocr.txt file is generated, thus leading into a 'failed to open stream' error in PHP.
What am I doing wrong?
Perhaps the missing environment variables in PHP is the problem here. Have a look at my question here to see if setting HOME
or PATH
sorts this out?