I have 9k PDFs of scanned text I need to clean up/run OCR on. The pages of each PDF are images saved as .ccitt
which I extract and convert to .png
using this Poppler (for Windows7) command:
pdfimages.exe -png file_in.pdf output/images/path
After cleaning up the .png
images I recombine them into a pdf
using this ImageMagick command:
magick.exe convert -compress Group4 -type bilevel -monochrome input/images/path file_out.pdf
The resulting file_out.pdf
is actually smaller than file_in.pdf
but it takes up to 25 seconds to Group4 compress just 18 images (ranging in size from 58kb to 140kb). It would take 65 hours to convert all images into 9k+ pdfs this way :'(
The same thing via GraphicsMagick:
gm convert -compress Group4 -type bilevel -monochrome input/images/path file_out.pdf
inflates file_out.pdf
to over 40x the size of file_in.pdf
.
What am I missing? I thought GraphicsMagick was supposed to be leaner/meaner than ImageMagick.
ImageMagick is not a good processor for vector images such as PDF. It will rasterize your PDF and save each dot as an element of the pdf. That may be why it takes so long. The PDF is now a raster image (much larger than the original vector image) in vector shell.
If your input PDF is already black/white, then you only need the compress group 4.
Starting with a 25 KB PDF
If I just convert it.
time magick ImageOnly.pdf result1.pdf
real 0m0.276s
user 0m0.563s
sys 0m0.038s
time magick ImageOnly.pdf -compress Group4 result2.pdf
real 0m0.275s
user 0m0.562s
sys 0m0.036s
So it is not the group 4 compression that is slowing it dow.
However, the quality will not be terrific. So one should add -density 300 before reading the PDF. But that will slow it down.
time magick -density 300 ImageOnly.pdf -compress Group4 result3.pdf
real 0m2.026s
user 0m2.863s
sys 0m0.182s