Search code examples
javapdfbox

Problem with PDFTextStripper().getText when using PDType0Font in pdfbox


I've started to work with PDType0Font recently (we've used PDType1Font.HELVETICA but needed unicode support) and I'm facing an error where i'm adding lines to the file using PDPageContentStream but PDFTextStripper.getText doesn't get the updated file contents.

I'm loading the font:

PDType0Font.load(document, fontFile)

And creating the contentStream as follows:

PDPageContentStream(document, pdPage, PDPageContentStream.AppendMode.PREPEND, false)

my function that adds content to the pdf is:

  private fun addTextToContents(contentStream: PDPageContentStream, txtLines: List<String>, x: Float, y: Float, pdfFont: PDFont, fontSize: Float, maxWidth: Float) {
     contentStream.beginText()
     contentStream.setFont(pdfFont, fontSize)
     contentStream.newLineAtOffset(x, y)
     txtLines.forEach { txt ->
       contentStream.showText(txt)
       contentStream.newLineAtOffset(0.0F, -fontSize)
     }
     contentStream.endText()
     contentStream.close()

When i'm trying to read the content of the file using PDFTextStripper.getText i'm getting the file before the changes. However, if I'm adding document.save before reading to PDFTextStripper, it works.

      val txt: String = PDFTextStripper().getText(doc) //not working

      doc.save(//File)
      val txt: String = PDFTextStripper().getText(doc) //working

if I'm using PDType1Font.HELVETICA in

contentStream.setFont(pdfFont, fontSize)

Everything is working without any problems and without saving the doc before reading the text.

I'm suspecting that the issue is with the code in PDPageContentStream.showTextInternal():

        // Unicode code points to keep when subsetting
    if (font.willBeSubset())
    {
        int offset = 0;
        while (offset < text.length())
        {
            int codePoint = text.codePointAt(offset);
            font.addToSubset(codePoint);
            offset += Character.charCount(codePoint);
        }
    }

This is the only thing that is not the same when using PDType0Font with embedsubsets and PDType1Font.

Can someone help with this? What am I doing wrong?


Solution

  • Your question, in particular the quoted code, already hints at the answer to your question:

    When using a font that will be subset (font.willBeSubset() == true), the associated PDF objects are unfinished until the file is saved. Text extraction on the other hand needs the finished PDF objects to properly work. Thus, don't apply text extraction to a document that is still being created and uses fonts that will be subset.

    You describe your use case as

    for our unit tests, we are adding text (mandatory text for us) to the document and then using PDFTextStripper we are validating that the file has the proper fields.

    As Tilman proposes: Then it would make more sense to save the PDF, and then to reload. That would be a more realistic test. Not saving is cutting corners IMHO.

    Indeed, in unit tests you should first produce the final PDF as it will be sent out (i.e. saving it, either to the file system or to memory), then reload that file, and test only this reloaded document.