Search code examples
pdfpdfbox

Converting PDF to image with PDFBox results in strange characters (Font issue)


I use PDFBox (2.X) to convert PDFs to images. The whole thing is is running under linux and previously I had some trouble converting certain PDFs with non-embedded fonts. Then I added the base-14 fonts to the system and everything worked. So far so good.

Now a PDF came in which uses Courier-Bold but the result is the following, although Courier-Bold is installed on the system. (Should be roman letters cause my russion is a bit rusty ;-):

enter image description here

So I am a bit puzzled why the PDF isn't converted correctly. The font in the PDF is defined as 1 0 obj <</Subtype/Type1/Type/Font/BaseFont/Courier-Bold/Encoding/WinAnsiEncoding>>

So why isn't PDFBox isn't selecting the right font? There is no warning shown while converting the PDF. The following fonts are installed:

  • Courier.ttf
  • CourierBold.ttf
  • CourierOblique.ttf
  • Courierboldoblique.ttf

I also installed the additional fonts mentioned in the comments (CourierNewPS-BoldMT,CourierNew-Bold,LiberationMono-Bold,NimbusMonL-Bold) but neither worked. Everytime I added a new font (to /.local/share/font) I get the message from PDFBox that a new font was found - so the font itself is recognized. It must be something else.


Solution

  • The cause was related to the font itself. Currently, PDFBox expects either "Courier-Bold", or as substitutes, fonts with the name

    • CourierNewPS-BoldMT
    • CourierNew-Bold
    • LiberationMono-Bold
    • NimbusMonL-Bold

    Removing the "Courier-Bold" font and adding one of the fonts above solved the problem. The most probable explanation is that the font was broken.