I have implemented an API with Laminas and Mezzio (former Zend Expressive). There I have a handler which uses the thiagoalessio\TesseractOCR library (https://github.com/thiagoalessio/tesseract-ocr-for-php) to call Tesseract from PHP.
On my development environment everything works fine. Getting an image text by calling the API takes 2-6 seconds.
Now first I have deployed the API to a Google Cloud VM and now I have deployed it to a Raspberry Pi 4 4GB RAM model. Both are very slow! A request response takes 25-30 seconds. Tesseract doesn't seem to be the problem. If I call it from the CLI it is super fast. But simple API calls aren't slow either! It seems that the combination of Lamina / Mezzio with Tesseract is super slow. I really do nothing but extracting the text from the image and sending it back as a JSON response.
I am running php 7.3 on a apache2 server. The Pi is in my local network connected via LAN. I am testing the API calls with Postman.
What can I do to increase performance? Is it the hardware?
This is my handler code
<?php
declare(strict_types=1);
namespace App\Handler;
use Laminas\Diactoros\Response\JsonResponse;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\ServerRequestInterface;
use Psr\Http\Server\RequestHandlerInterface;
use thiagoalessio\TesseractOCR\TesseractOCR;
class OcrHandler implements RequestHandlerInterface
{
public function handle(ServerRequestInterface $request) : ResponseInterface
{
$measure = [];
$start = microtime(true);
$body = $request->getBody();
$result = '';
if(!empty($body->getContents())) {
$measure['body_parse'] = microtime(true) - $start;
$start = microtime(true);
$guid = $this->GUID();
$imagePath = sprintf('%s/data/%s', getcwd(), $guid);
file_put_contents($imagePath, $body->getContents());
$measure['image_write'] = microtime(true) - $start;
$start = microtime(true);
$tesseractOcr = new TesseractOCR($imagePath);
$tesseractOcr->withoutTempFiles();
$result = $tesseractOcr->lang('deu')->run();
$measure['image_parsing'] = microtime(true) - $start;
$start = microtime(true);
unlink($imagePath);
$measure['image_delete'] = microtime(true) - $start;
}
return new JsonResponse(['result' => $result, 'measure' => $measure]);
}
private function GUID()
{
if (function_exists('com_create_guid') === true)
return trim(com_create_guid(), '{}');
return sprintf('%04X%04X-%04X-%04X-%04X-%04X%04X%04X', mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(16384, 20479), mt_rand(32768, 49151), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535));
}
}
Edit
Ok so I've added time measurement and found the bottleneck. It is indeed the "image_parsing", the execution of Tesseract. Which for me, is strange, because as I said, on the CLI it's super fast. Here it takes most of the response time (27,9 sec)!
{
"result": "...",
"measure": {
"body_parse": 0.0018658638000488281,
"image_write": 0.0020492076873779297,
"image_parsing": 27.909277200698853,
"image_delete": 0.0005030632019042969
}
}
Why is it so fast on the CLI but so slow when I call it from PHP? Is there any possible performance improvement?
Ok so as I already mentioned in my edit, the bottlebeck seems to be the image parsing. To be more specific, the bottleneck is the library "thiagoalessio/tesseract-ocr-for-php". The following code, which uses the exec-function of PHP instead of the library takes 5,82 seconds (compared to 27,9 seconds). That's a huge difference. The following code works fine, assuming you've got tesseract installed on your machine:
<?php
declare(strict_types=1);
namespace App\Handler;
use Laminas\Diactoros\Response\JsonResponse;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\ServerRequestInterface;
use Psr\Http\Server\RequestHandlerInterface;
class OcrHandler implements RequestHandlerInterface
{
public function handle(ServerRequestInterface $request) : ResponseInterface
{
$measure = [];
$start = microtime(true);
$body = $request->getBody();
$result = '';
if(!empty($body->getContents())) {
$measure['body_parse'] = microtime(true) - $start;
$start = microtime(true);
$guid = $this->GUID();
$imagePath = sprintf('%s/data/%s', getcwd(), $guid);
$outputPath = $imagePath . '_out';
file_put_contents($imagePath, $body->getContents());
$measure['image_write'] = microtime(true) - $start;
$start = microtime(true);
exec(sprintf('tesseract %s %s', $imagePath, $outputPath));
$result = file_get_contents($outputPath . '.txt');
$measure['image_parsing'] = microtime(true) - $start;
$start = microtime(true);
unlink($imagePath);
unlink($outputPath . '.txt');
$measure['image_delete'] = microtime(true) - $start;
}
return new JsonResponse(['result' => $result, 'measure' => $measure]);
}
private function GUID()
{
if (function_exists('com_create_guid') === true)
return trim(com_create_guid(), '{}');
return sprintf('%04X%04X-%04X-%04X-%04X-%04X%04X%04X', mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(16384, 20479), mt_rand(32768, 49151), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535));
}
}
You find a lot of recommendations for the thiagoalessio/tesseract-ocr-for-php library on Stack Overflow but you should check your performance! On my dev machine it worked fine, but on production it's super slow and production is a question of cost.