Search code examples
pdfbox

PDFBox renderImageWithDPI produces images with missing content due to absent embedded fonts - how do I resolve this?


PDFBox renderImageWithDPI only partially renders text because of missing embedded(?) fonts.

  • Using PDFBox 2.0.28 then tried PDFBox 3.0.0-RC1

  • Created a PDDocument using Loader.loadPDF

  • Created a PDFRenderer from the PDDocument

  • Executed renderImageWithDPI(pagenum, dpi, RGBObj) on PDDocument

  • Obtained java.awt.image.BufferedImage

  • Write as jpg using javax.imageio.ImageIO

  • However, there is missing content in the images

  • Extracted 2 sample problematic pages from the pdf using PDFSam basic

  • Pg 1 which generates image 1

  • and Pg 2 which generated image 2

  • Have highlighted areas where the content is missing.

  • On executing PreflightParser.validate obtain the messages below:-

1.4 : Trailer Syntax error, /XRef cross reference streams are not allowed
5.2.2 : Forbidden field in an annotation definition, Flags of Link annotation are invalid
2.3.2 : Unexpected value for key in Graphic object definition, Unexpected 'true' value for 'Interpolate' Key
2.4.2 : Invalid Color space, The operator "k" can't be used with RGB Profile
2.4.3 : Invalid Color space, The operator "f" can't be used without Color Profile
3.1.4 : Invalid Font definition, ELWKFI+OptimaLTStd: The Charset entry is missing for the Type1 Subset
3.1.4 : Invalid Font definition, JECWGC+InsigniaLTStd: The Charset entry is missing for the Type1 Subset
3.1.4 : Invalid Font definition, PHSMMZ+OptimaLTStd-Bold: The Charset entry is missing for the Type1 Subset
3.1.4 : Invalid Font definition, EHCNNL+OptimaLTStd-Italic: The Charset entry is missing for the Type1 Subset
3.1.4 : Invalid Font definition, QBVSKF+HelveticaLTStd-Obl: The Charset entry is missing for the Type1 Subset
3.1.9 : Invalid Font definition, UBAPGG+OptimaLTStd: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, UBAPGG+OptimaLTStd: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, UBAPGG+OptimaLTStd: The FontFile can't be read
3.1.9 : Invalid Font definition, ORMCFE+HelveticaLTStd-Obl: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, ORMCFE+HelveticaLTStd-Obl: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, ORMCFE+HelveticaLTStd-Obl: The FontFile can't be read
3.1.9 : Invalid Font definition, TFEWKU+HelveticaLTStd-Roman: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, TFEWKU+HelveticaLTStd-Roman: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, TFEWKU+HelveticaLTStd-Roman: The FontFile can't be read
3.1.4 : Invalid Font definition, CRQQXS+OptimaLTStd: The Charset entry is missing for the Type1 Subset
3.1.4 : Invalid Font definition, MVVAWX+InsigniaLTStd: The Charset entry is missing for the Type1 Subset
3.1.4 : Invalid Font definition, YIWFBD+OptimaLTStd-Bold: The Charset entry is missing for the Type1 Subset
3.1.11 : Invalid Font definition, JYHLHF+OptimaLTStd: The CIDSet entry is missing for the Composite Subset
3.1.9 : Invalid Font definition, LDXBBC+OptimaLTStd: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, LDXBBC+OptimaLTStd: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, LDXBBC+OptimaLTStd: The FontFile can't be read
3.1.9 : Invalid Font definition, FSNSYC+OptimaLTStd: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, FSNSYC+OptimaLTStd: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, FSNSYC+OptimaLTStd: The FontFile can't be read
3.1.9 : Invalid Font definition, LVYKUL+InsigniaLTStd: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, LVYKUL+InsigniaLTStd: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, LVYKUL+InsigniaLTStd: The FontFile can't be read
3.1.9 : Invalid Font definition, FUYTUP+OptimaLTStd-Italic: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, FUYTUP+OptimaLTStd-Italic: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, FUYTUP+OptimaLTStd-Italic: The FontFile can't be read
3.1.9 : Invalid Font definition, GZVYQO+OptimaLTStd-Bold: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, GZVYQO+OptimaLTStd-Bold: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, GZVYQO+OptimaLTStd-Bold: The FontFile can't be read
3.1.9 : Invalid Font definition, GWNIWZ+HelveticaLTStd-Roman: mandatory CIDToGIDMap missing
3.1.11 : Invalid Font definition, GWNIWZ+HelveticaLTStd-Roman: The CIDSet entry is missing for the Composite Subset
3.2.3 : Font damaged, GWNIWZ+HelveticaLTStd-Roman: The FontFile can't be read
7.1 : Error on MetaData, Metadata is not a stream

Which also corroborate to execution warnings

May 26, 2023 12:40:01 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font GWNIWZ+HelveticaLTStd-Roman
java.io.IOException: head is mandatory
    at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:182)
    at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
    at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
    at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
    at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
    at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
    at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:114)
    at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:67)
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:138)
    at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:88)
    at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:96)
    at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:143)
    at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:66)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:849)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:495)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:469)
    at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:142)
    at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:264)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:338)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:259)
    at org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:245)

Additional truncated messages

May 26, 2023 12:40:00 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font UBAPGG+OptimaLTStd
java.io.IOException: head is mandatory

May 26, 2023 12:40:01 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font GZVYQO+OptimaLTStd-Bold
java.io.IOException: head is mandatory

May 26, 2023 12:40:01 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font FUYTUP+OptimaLTStd-Italic
java.io.IOException: head is mandatory

May 26, 2023 12:40:01 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
WARNING: Could not read embedded OTF for font FSNSYC+OptimaLTStd
java.io.IOException: head is mandatory

Although fallback fonts seen to be used they don't work either.

May 26, 2023 12:40:01 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 findFontOrSubstitute WARNING: Using fallback font LiberationSans for CID-keyed TrueType font GWNIWZ+HelveticaLTStd-Roman

I also see warning messages as below, unsure how to process / address.

May 26, 2023 12:40:01 PM org.apache.pdfbox.pdmodel.graphics.color.PDICCBased ensureDisplayProfile WARNING: ICC profile is Perceptual, ignoring, treating as Display class

Need multiple assistance.

Question 1: How do I add a font?

  • If I try using the below, The codeblock below where I get a page and add a font before rendering doesnt have any impact.
  • Note, getDocument() and setDocument and setPdfRenderer are convenience methods in my implementation class. setPdfRenderer() contains PDFRenderer renderer = new PDFRenderer(document); and sets it to a class variable.
int position = 0;
PDPage page = getDocument().getPage(position);
PDResources resources = page.getResources();
OTFParser otfParser = new OTFParser();
OpenTypeFont otf = otfParser.parse(new File("OptimaLTStd.otf"));
PDFont font = PDType0Font.load(document, otf, false);

resources.add(font);
page.setResources(resources);
if (position == 0) {
   getDocument().getPages().remove(page);
   getDocument().getPages().add(page);
   setDocument(getDocument());
   setPdfRenderer(getDocument());
} else {
   PDPage prevPage = getDocument().getPage(position - 1);
   getDocument().getPages().insertBefore(page, prevPage);
   setDocument(getDocument());
   setPdfRenderer(getDocument());           }
  • Downloaded OTF from link

Question 2: Do we have an override in pdfrender to skip glyph processing so that font related issues do not impact image generation ?


Solution

  • The problem of the missing text is caused by 0 width definitions for the fonts in the PDF, which incorrectly influences a "stretching" algorithm hen rendering. This has been fixed in ticket PDFBOX-5611 and will be in the version 2.0.29. Until then, a snapshot build will be available.