Search code examples
phpzend-frameworktesseract

Tesseract with PHP performance


I have implemented an API with Laminas and Mezzio (former Zend Expressive). There I have a handler which uses the thiagoalessio\TesseractOCR library (https://github.com/thiagoalessio/tesseract-ocr-for-php) to call Tesseract from PHP.

On my development environment everything works fine. Getting an image text by calling the API takes 2-6 seconds.

Now first I have deployed the API to a Google Cloud VM and now I have deployed it to a Raspberry Pi 4 4GB RAM model. Both are very slow! A request response takes 25-30 seconds. Tesseract doesn't seem to be the problem. If I call it from the CLI it is super fast. But simple API calls aren't slow either! It seems that the combination of Lamina / Mezzio with Tesseract is super slow. I really do nothing but extracting the text from the image and sending it back as a JSON response.

I am running php 7.3 on a apache2 server. The Pi is in my local network connected via LAN. I am testing the API calls with Postman.

What can I do to increase performance? Is it the hardware?

This is my handler code

<?php

declare(strict_types=1);

namespace App\Handler;

use Laminas\Diactoros\Response\JsonResponse;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\ServerRequestInterface;
use Psr\Http\Server\RequestHandlerInterface;
use thiagoalessio\TesseractOCR\TesseractOCR;

class OcrHandler implements RequestHandlerInterface
{
    public function handle(ServerRequestInterface $request) : ResponseInterface
    {
        $measure = [];
        $start = microtime(true);

        $body = $request->getBody();

        $result = '';

        if(!empty($body->getContents())) {
            $measure['body_parse'] = microtime(true) - $start;
            $start = microtime(true);    

            $guid = $this->GUID();
            $imagePath = sprintf('%s/data/%s', getcwd(), $guid);

            file_put_contents($imagePath, $body->getContents());
            
            $measure['image_write'] = microtime(true) - $start;
            $start = microtime(true);

            $tesseractOcr = new TesseractOCR($imagePath);
            $tesseractOcr->withoutTempFiles();
            $result = $tesseractOcr->lang('deu')->run();
            
            $measure['image_parsing'] = microtime(true) - $start;
            $start = microtime(true);

            unlink($imagePath);

            $measure['image_delete'] = microtime(true) - $start;
        }

        return new JsonResponse(['result' => $result, 'measure' => $measure]);
    }

    private function GUID()
    {
        if (function_exists('com_create_guid') === true)
            return trim(com_create_guid(), '{}');
    
        return sprintf('%04X%04X-%04X-%04X-%04X-%04X%04X%04X', mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(16384, 20479), mt_rand(32768, 49151), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535));
    }
}

Edit

Ok so I've added time measurement and found the bottleneck. It is indeed the "image_parsing", the execution of Tesseract. Which for me, is strange, because as I said, on the CLI it's super fast. Here it takes most of the response time (27,9 sec)!

{
    "result": "...",
    "measure": {
        "body_parse": 0.0018658638000488281,
        "image_write": 0.0020492076873779297,
        "image_parsing": 27.909277200698853,
        "image_delete": 0.0005030632019042969
    }
}

Why is it so fast on the CLI but so slow when I call it from PHP? Is there any possible performance improvement?


Solution

  • Ok so as I already mentioned in my edit, the bottlebeck seems to be the image parsing. To be more specific, the bottleneck is the library "thiagoalessio/tesseract-ocr-for-php". The following code, which uses the exec-function of PHP instead of the library takes 5,82 seconds (compared to 27,9 seconds). That's a huge difference. The following code works fine, assuming you've got tesseract installed on your machine:

    <?php
    
    declare(strict_types=1);
    
    namespace App\Handler;
    
    use Laminas\Diactoros\Response\JsonResponse;
    use Psr\Http\Message\ResponseInterface;
    use Psr\Http\Message\ServerRequestInterface;
    use Psr\Http\Server\RequestHandlerInterface;
    
    class OcrHandler implements RequestHandlerInterface
    {
        public function handle(ServerRequestInterface $request) : ResponseInterface
        {
            $measure = [];
            $start = microtime(true);
    
            $body = $request->getBody();
    
            $result = '';
    
            if(!empty($body->getContents())) {
                $measure['body_parse'] = microtime(true) - $start;
                $start = microtime(true);    
    
                $guid = $this->GUID();
                $imagePath = sprintf('%s/data/%s', getcwd(), $guid);
                $outputPath = $imagePath . '_out';
    
                file_put_contents($imagePath, $body->getContents());
                
                $measure['image_write'] = microtime(true) - $start;
                $start = microtime(true);
    
                exec(sprintf('tesseract %s %s', $imagePath, $outputPath));
                $result = file_get_contents($outputPath . '.txt');
                
                $measure['image_parsing'] = microtime(true) - $start;
                $start = microtime(true);
    
                unlink($imagePath);
                unlink($outputPath . '.txt');
    
                $measure['image_delete'] = microtime(true) - $start;
            }
    
            return new JsonResponse(['result' => $result, 'measure' => $measure]);
        }
    
        private function GUID()
        {
            if (function_exists('com_create_guid') === true)
                return trim(com_create_guid(), '{}');
        
            return sprintf('%04X%04X-%04X-%04X-%04X-%04X%04X%04X', mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(16384, 20479), mt_rand(32768, 49151), mt_rand(0, 65535), mt_rand(0, 65535), mt_rand(0, 65535));
        }
    }
    

    You find a lot of recommendations for the thiagoalessio/tesseract-ocr-for-php library on Stack Overflow but you should check your performance! On my dev machine it worked fine, but on production it's super slow and production is a question of cost.