Search code examples
shellpdfunicodeimagemagickubuntu-14.04

Problems using ImageMagick for converting PDF with accented characters


I am having a problem when converting PDF to Images using ImageMagick or Ghostscript. All accented characters disappear from the converted image. I found a couple of people having the same problem and apparently updating ImageMagick package and Ghostscript fixed it, but not for me.

I am using this PDF file on every tests I made: https://www.dropbox.com/s/3gso0sw1e1n8f9r/error-with-accents.pdf?dl=0

I have an Ubuntu 14.04.2 LTS server on Azure where I need ImageMagick to work. From the official repositories I have ImageMagick 6.7.7 and Ghostscript 9.10. Later, I tried upgrading them in order to fix my issue and now I have also ImageMagick 6.8.9-10 running on /opt/imagemagick-6.8 folder and I added Ubuntu's 15.04 repository so I could install Ghostscript 9.15 directly through apt-get. None of these fixed the problem for me.

Here are my latests attempts on the Ubuntu 14.04 server:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.2 LTS
Release:    14.04
Codename:   trusty

$ /opt/imagemagick-6.8/bin/convert -version
Version: ImageMagick 6.8.9-10 Q16 x86_64 2015-07-30 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2014 ImageMagick Studio LLC
Features: DPC OpenMP
Delegates: jng jpeg png x xml zlib

$ /opt/imagemagick-6.8/bin/convert -list configure |grep DELEGATES
DELEGATES      mpeg jng jpeg png ps x xml zlib

$ /opt/imagemagick-6.8/bin/convert error-with-accents.pdf -verbose -alpha off -resample 150 -density 150 -quality '80' im-test.jpg
   **** Warning: considering '0000000000 XXXXX n' as a free entry.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Mac OS X 10.10.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

error-with-accents.pdf=>im-test.jpg PDF 595x794=>1240x1654 1240x1654+0+0 16-bit sRGB 172KB 0.440u 0:00.240

$ gs -v
GPL Ghostscript 9.15 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc.  All rights reserved.

$ gs -dBATCH -dNOPAUSE -sDEVICE=jpeg -sOutputFile=gs-test.jpg error-with-accents.pdf 
GPL Ghostscript 9.15 (2014-09-22)
Copyright (C) 2014 Artifex Software, Inc.  All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
   **** Warning: considering '0000000000 XXXXX n' as a free entry.
Processing pages 1 through 1.
Page 1

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Mac OS X 10.10.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

$ convert -version
Version: ImageMagick 6.7.7-10 2014-03-06 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP    

$ convert -list configure |grep DELEGATES
DELEGATES     bzlib djvu fftw fontconfig freetype jbig jpeg jng jp2 lcms2 lqr lzma openexr pango png rsvg tiff x11 xml wmf zlib

$ convert error-with-accents.pdf -verbose -alpha off -resample 150 -density 150 -quality '80' im-test-6.7.7.jpg
   **** Warning: considering '0000000000 XXXXX n' as a free entry.

   **** This file had errors that were repaired or ignored.
   **** The file was produced by: 
   **** >>>> Mac OS X 10.10.4 Quartz PDFContext <<<<
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

error-with-accents.pdf=>im-test-6.7.7.jpg PDF 595x794=>1240x1654 1240x1654+0+0 16-bit DirectClass 160KB 0.490u 0:00.279

All with the same results:

gs-test.jpg

im-test.jpg

im-test-6.7.7.jpg

I am able to run Ghostscript and ImageMagick correctly on a Mac OS. And, according to this post, the versions I have on Ubuntu should work. So I'm thinking it's something related to FreeType fonts. Which I know nothing on how to fix this. Any help?


Solution

  • The PDF document you are trying to process was very often modified and re-saved: 455 times between 2010-03-06 and 2014-06-17.

    You can verify that by running pdfinfo -meta error-with-accents.pdf.

    I do not speak or read Portuguese, so I cannot recognize immediately if an accent is missing in an output image where one should be.

    When I tried your command, with IM v6.9.0-0 Q16 x86_64 2015-05-14 (using Ghostscript v9.16), I do no see any error:

    enter image description here

    Your PDF has all the fonts it uses embedded (see the emb column). This means, that FreeType will not be employed to look for any replacement/substitute font:

    $ pdffonts error-with-accents.pdf 
    
      name                       type       encoding         emb sub uni object ID
      -------------------------- ---------- ---------------- --- --- --- ---------
      RUXYWW+ConduitITC-Light    Type 1C    MacRoman         yes yes no      14  0
      NOYZMG+Y2KNeophyte         TrueType   WinAnsi          yes yes yes     10  0
      MVLYKX+ConduitITC-Medium   Type 1C    MacRoman         yes yes no      15  0
      JDNVDM+ConduitITC-Bold     Type 1C    MacRoman         yes yes no      13  0
    

    In any case: You should concentrate to get a version of Ghostscript which processes your PDF correctly. Because ImageMagick does not do any PDF processing on its own -- it relies on Ghostscript as its "delegate" to do so.