Search code examples
javapdfpdfbox

Getting OutOfMemoryError with PDFBox Annotation constructAppearances() method


In a Nutshell

I've been working on a program that gets a pdf, highlights some words (via pdfbox Mark Annotation) and saves the new pdf.

In order to these annotations be visible on some viewers like pdf.js, it's needed to call the pdAnnotationTextMarkup.constructAppearances() before adding the mark annotation into the page Annotation list.

However, by doing so, I get an OutOfMemoryError when dealing with huge documents that would contain thousands of mark annotations.

I'd like to know if there's a way to prevent this from happening.

(this is a kind of a sequel of this ticket, but that's not much relevant for this one)

Technical Specification:

PDFBox 2.0.17
Java 11.0.6+10, AdoptOpenJDK
MacOS Catalina 10.15.2, 16gb, x86_64

My Code

//my pdf has 216 pages     
for (int pageIndex = 0; pageIndex < numberOfPages; pageIndex++) {
    PDPage page = document.getPage(pageIndex);
    List<PDAnnotation> annotations = page.getAnnotations();

    // each coordinate obj represents a hl annotation. crashing with 7.816 elements
    for (CoordinatePoint coordinate : coordinates) {
        PDAnnotationTextMarkup txtMark = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT);
        txtMark.setRectangle(pdRectangle);
        txtMark.setQuadPoints(quadPoints);
        txtMark.setColor(getColor());
        txtMark.setTitlePopup(coordinate.getHintDescription());
        txtMark.setReadOnly(true);

        // this is what makes everything visible on pdf.js and what causes the Java heap space error
        txtMark.constructAppearances();

        annotations.add(txtMark);
    }
}

Current Result

This is the heavy pdf doc that is leading to the issue: https://pdfhost.io/v/I~nu~.6G_French_Intensive_Care_Society_International_congress_Ranimation_2016.pdf

My program tries to add 7.816 annotations to it throughout 216 pages.

and the stacktrace:

[main] INFO highlight.PDFAnnotation - Highlighting 13613_2016_Article_114.pdf...
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.pdfbox.io.ScratchFile.<init>(ScratchFile.java:128)
    at org.apache.pdfbox.io.ScratchFile.getMainMemoryOnlyInstance(ScratchFile.java:143)
    at org.apache.pdfbox.cos.COSStream.<init>(COSStream.java:61)
    at org.apache.pdfbox.pdmodel.interactive.annotation.handlers.PDAbstractAppearanceHandler.createCOSStream(PDAbstractAppearanceHandler.java:106)
    at org.apache.pdfbox.pdmodel.interactive.annotation.handlers.PDHighlightAppearanceHandler.generateNormalAppearance(PDHighlightAppearanceHandler.java:136)
    at org.apache.pdfbox.pdmodel.interactive.annotation.handlers.PDHighlightAppearanceHandler.generateAppearanceStreams(PDHighlightAppearanceHandler.java:59)
    at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationTextMarkup.constructAppearances(PDAnnotationTextMarkup.java:175)
    at org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationTextMarkup.constructAppearances(PDAnnotationTextMarkup.java:147)
    at highlight.PDFAnnotation.drawHLAnnotations(PDFAnnotation.java:288)

I've already tried to increase my jvm xmx and xms parameters to like -Xmx10g -Xms10g, which only postponed the crash a little bit.

What I Want

I want to prevent this memory issue and still be able to see my annotations in pdf.js viewer. Without calling constructAppearances the process is much more faster, I don't have this issue, but the annotations can only be seen on some pdf viewers, like Adobe.

Any suggestions? Am I doing anything wrong here or missing something?


Solution

  • In the upcoming version 2.0.19, construct the appearances like this:

    annotation.constructAppearances(document);
    

    In 2.0.18 and earlier, you need to initialize the appearance handler yourself:

    setCustomAppearanceHandler(new PDHighlightAppearanceHandler(annotation, document));
    

    That line can be removed in 2.0.19 as this is the default appearance handler.

    Why all this? So that the document common memory space ("scratch file") is used in the annotation handler instead to create a new one each time (which is big). The later is done when calling new COSStream() instead of document.getDocument().createCOSStream().

    All this is of course only important when doing many annotations.

    related PDFBox issues: PDFBOX-4772 and PDFBOX-4080