performance pdf imagemagick compression graphicsmagick

Speed up (yet keep file size low) conversion of multiple PNGs to PDF?

I have 9k PDFs of scanned text I need to clean up/run OCR on. The pages of each PDF are images saved as .ccitt which I extract and convert to .png using this Poppler (for Windows7) command:

pdfimages.exe -png file_in.pdf output/images/path

After cleaning up the .png images I recombine them into a pdf using this ImageMagick command:

magick.exe convert -compress Group4 -type bilevel -monochrome input/images/path file_out.pdf

The resulting file_out.pdf is actually smaller than file_in.pdf but it takes up to 25 seconds to Group4 compress just 18 images (ranging in size from 58kb to 140kb). It would take 65 hours to convert all images into 9k+ pdfs this way :'(

The same thing via GraphicsMagick:

gm convert -compress Group4 -type bilevel -monochrome input/images/path file_out.pdf

inflates file_out.pdf to over 40x the size of file_in.pdf.

What am I missing? I thought GraphicsMagick was supposed to be leaner/meaner than ImageMagick.

Solution

ImageMagick is not a good processor for vector images such as PDF. It will rasterize your PDF and save each dot as an element of the pdf. That may be why it takes so long. The PDF is now a raster image (much larger than the original vector image) in vector shell.

If your input PDF is already black/white, then you only need the compress group 4.

Starting with a 25 KB PDF

If I just convert it.

time magick ImageOnly.pdf result1.pdf

real    0m0.276s
user    0m0.563s
sys 0m0.038s

time magick ImageOnly.pdf -compress Group4 result2.pdf

real    0m0.275s
user    0m0.562s
sys 0m0.036s

So it is not the group 4 compression that is slowing it dow.

However, the quality will not be terrific. So one should add -density 300 before reading the PDF. But that will slow it down.

time magick -density 300 ImageOnly.pdf -compress Group4 result3.pdf

real    0m2.026s
user    0m2.863s
sys 0m0.182s