Search code examples
phpimagemagicktesseract

Converting PDF to PNG for Tesseract to process


I'm having an issue at the moment with Imagemagick and Tesseract.

I'm working on a command-line classifier for documents in PHP. The idea is that it takes in PDF documents and uses the League Pipeline package to pass it through numerous steps. The steps I've identified as necessary are as follows:

  1. Convert PDF to a PNG file
  2. Extract text from PNG file
  3. Run text through a machine learning library to classify it

The main command for that looks like this:

<?php

namespace Matthewbdaly\LetterClassifier\Commands;

use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Input\InputArgument;
use League\Pipeline\Pipeline;
use Matthewbdaly\LetterClassifier\Stages\ConvertPdfToPng;
use Matthewbdaly\LetterClassifier\Stages\ReadFile;

class Processor extends Command
{
    protected function configure()
    {
        $this->setName('process')
            ->setDescription('Processes a file')
            ->setHelp('This command processes a file')
            ->addArgument('file', InputArgument::REQUIRED, 'File to process');
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $file = $input->getArgument('file');
        $pipeline = (new Pipeline)
            ->pipe(new ConvertPdfToPng)
            ->pipe(new ReadFile);
        $pipeline->process($file);
    }
}

As you can see, it accepts a filename as the first argument, then defines a pipeline for the required steps, before passing the file to the pipeline.

The step for converting the PDF looks like this:

<?php

namespace Matthewbdaly\LetterClassifier\Stages;

use Imagick;

class ConvertPdfToPng
{
    public function __invoke($file)
    {
        $tmp = tmpfile();
        $uri = stream_get_meta_data($tmp)['uri'];
        $img = new Imagick($file);
        $img->setResolution(300, 300);
        $img->setImageDepth(8);
        $img->setImageFormat('png');
        $img->writeImage($uri);
        return $tmp;
    }
}

It writes a PNG version of the PDF as a temporary file. The generated file looks OK, at least to my eye, but it can't be read correctly by Tesseract. Here's the second step where Tesseract should process the file:

<?php

namespace Matthewbdaly\LetterClassifier\Stages;

use thiagoalessio\TesseractOCR\TesseractOCR;

class ReadFile
{

    public function __invoke($file)
    {
        $uri = stream_get_meta_data($file)['uri'];
        $ocr = new TesseractOCR($uri);
        $output = $ocr->lang('eng')->run();
        eval(\Psy\Sh());
    }
}

The output from Psysh looks like this:

=> """
   Am sum\n
   \n
   mm“ m mun SuHrkw-l\n
   n m 51mm\n
   \n
   mm\n
   \n
   um um\n
   \n
   ms Murine\n
   1 Elm: 51mm\n
   Emnuumn\n
   \n
   a mu\n
   \n
   m Mm 2m-\n
   Dav st-n-m.\n
   \n
   P‘Eualanfl ma lumnflarvlmamrmy ”Hay ”mum-m-\n
   we we “mum-m n: "mum,“ m mun\n
   \n
   vm [harem\n
   \n
   Am smrm
   """

This is not the content of the letter I'm trying to classify - the text is getting mangled. If I run the following commands from the shell, they work as expected to convert and write the letter's text to the output file:

convert -density 300 Quote.pdf output.png
tesseract output.png output

And if I hardcode the path to the file in the Tesseract stage to point at the output.png generated using the convert command, that works. So I'm fairly confident the issue is with the step to generate the PNG file. I'm not that experienced with using Imagemagick, so I'm unsure why the file can't be processed, but it seems like there's a setting of some kind that I'm missing.

Can anyone suggest what the problem might be?


Solution

  • I suspect the problem is that Imagick reads the PDF before you call setResolution().

    Try instantiating an empty IMagick object, setting the resoltion and then reading the file:

    $img = new Imagick();
    $img->setResolution(300, 300);
    $img->readImage($file);