Search code examples
image-processingimagemagickpopplerpdftoppm

When extracting pages from PDF, Imagemagick improves images but blurs text, PDFtoppm retains crisp text but also dark images


I want to extract all pages from this PDF file, improve their color levels, and eventually OCR them.

I've used Imagemagick:

magick Historia_de_CA_vol1_Cap1_0.pdf mogrify -auto-level Historia_de_CA_vol1_Cap1_0-*.jpg,

which remarkably improves the quality of embedded images, as can be seen in the document's 1st and 21st pages. I suspect this is because Imagemagick properly interprets a transparency layer that is converted to a black or dark background by Adobe Acrobat Reader. Unfortunately, the extracted text is blurrier than in the original

I've also used poppler's PDFtoppm utility:

pdftoppm -jpeg Historia_de_CA_vol1_Cap1_0.pdf Historia_de_CA_vol1_Cap1_0,

which produces crisp text, suitable for OCR, but retains the poor quality of the embedded images seen on pages 1 and 21 of the original PDF, where transparency seems to be rendered as a dark layer.

How can I get Imagemagick to produce improved images and crisp text suitable for OCR, or conversely, how can I get PDFtoppm to properly render the suspected transparent layer in the original PDF?


Solution

  • Your imagemagick command may be flawed. With magick mogrify, do not separate them with images. The structure of magick mogrify is

    magick mogrify -path path_to_output -format format_for_output * (or *.suffix)
    

    This reads all images in the current directory and writes them with the same name to the desired directory and with the desired suffix.

    Perhaps you want just magick, not magick mogrify

    magick Historia_de_CA_vol1_Cap1_0.pdf -auto-level Historia_de_CA_vol1_Cap1_0.jpg
    

    That will create outputs with Historia_de_CA_vol1_Cap1_0-N.jpg where N is 0 to the number of pages.

    ADDITION

    To increase text sharpness, change the density and then resize by the inverse.

    magick -density 288 Historia_de_CA_vol1_Cap1_0.pdf -resize 25% -auto-level Historia_de_CA_vol1_Cap1_0.jpg
    

    (Note: density of 288=72x4, so resize by 1/4=25%)