Search code examples
javapdfitextecmredaction

getting exception while redacting pdf using itext


I am getting below exception while trying to redact pdf document using itext. The issue is very sporadic like sometime it is working and sometimes it is throwing error.

at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.access$6100(PdfContentStreamProcessor.java:60)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor$Do.invoke(PdfContentStreamProcessor.java:991)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpContentOperator.invoke(PdfCleanUpContentOperator.java:140)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.invokeOperator(PdfContentStreamProcessor.java:286)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:425)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUpPage(PdfCleanUpProcessor.java:160)
at com.itextpdf.text.pdf.pdfcleanup.PdfCleanUpProcessor.cleanUp(PdfCleanUpProcessor.java:135)
at RedactionClass.tgestRedactJavishsInput(RedactionClass.java:56)
at RedactionClass.main(RedactionClass.java:23)

Code which i am using to redact is below:

public static void testRedact() throws IOException, DocumentException {

    InputStream resource = new FileInputStream("D:/itext/edited_120192824_5 (1).pdf");
    OutputStream result = new FileOutputStream(new File(OUTPUTDIR,
            "aviteshs.pdf"));

    PdfReader reader = new PdfReader(resource);
    PdfStamper stamper = new PdfStamper(reader, result);
    int pageCount = reader.getNumberOfPages();
    Rectangle linkLocation1 = new Rectangle(440f, 700f, 470f, 710f);
    Rectangle linkLocation2 = new Rectangle(308f, 205f, 338f, 215f);
    Rectangle linkLocation3 = new Rectangle(90f, 155f, 130f, 165f);
    List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
    for (int currentPage = 1; currentPage <= pageCount; currentPage++) {
        if (currentPage == 1) {
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation1, BaseColor.BLACK));
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation2, BaseColor.BLACK));
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation3, BaseColor.BLACK));
        } else {
            cleanUpLocations.add(new PdfCleanUpLocation(currentPage,
                    linkLocation1, BaseColor.BLACK));
        }
    }
    PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations,
            stamper);
    try {
        cleaner.cleanUp();
    } catch (Exception e) {
        e.printStackTrace();
    }
    stamper.close();
    reader.close();

}

Due to customer document i am unable to share it , trying to find out some test data for same.

Please find the doc here:

https://drive.google.com/file/d/0B-zalNTEeIOwM1JJVWctcW8ydU0/view?usp=drivesdk


Solution

  • In short: The cause of the NullPointerException here is that iText does not support form XObject resource inheritance from the page they are displayed on. According to the PDF specification this construct is obsolete but it can be encountered in PDFs obeying early PDF references instead of the specification.

    The cause

    Page 1 of the document in question contains 4 XObject resources named I1, M0, P1, and Q0:

    RUPS screenshot

    As you can see in the screenshot, Q0 in particular has no own Resources dictionary. But its last instructions are

    q
    413 0 0 125 75 3086 cm
    /I1 Do
    Q
    

    Id est it references a resource I1.

    Now iText in case of form XObjects assumes that the resources their contents reference are contained in their own Resources dictionary.

    The result: iText accesses a null dictionary and a NullPointerException occurs.

    The specification

    The PDF specification ISO 32000-1 specifies:

    A resource dictionary shall be associated with a content stream in one of the following ways:

    • For a content stream that is the value of a page’s Contents entry (or is an element of an array that is the value of that entry), the resource dictionary shall be designated by the page dictionary’s Resources or is inherited, as described under 7.7.3.4, "Inheritance of Page Attributes," from some ancestor node of the page object.

    • For other content streams, a conforming writer shall include a Resources entry in the stream's dictionary specifying the resource dictionary which contains all the resources used by that content stream. This shall apply to content streams that define form XObjects, patterns, Type 3 fonts, and annotation.

    • PDF files written obeying earlier versions of PDF may have omitted the Resources entry in all form XObjects and Type 3 fonts used on a page. All resources that are referenced from those forms and fonts shall be inherited from the resource dictionary of the page on which they are used. This construct is obsolete and should not be used by conforming writers.

    (ISO 32000-1, section 7.8.3 - Resource Dictionaries)

    Thus, in the case at hand we are in the situation of the obsolete option three, Q0 references the XObject I1 defined in the resource dictionary of the page Q0 is used for.

    The document in question has a version header claiming PDF 1.5 conformance (in contrast to PDF 1.7 of the PDF specification). So let's look at the PDF Reference 1.5. The paragraph there corresponding to option three is:

    • A form XObject or a Type 3 font’s glyph description may omit the Resources entry, in which case resources will be looked up in the Resources entry of the page on which the form or font is used. This practice is not recommended.

    Summarized, therefore, the PDF in question uses a construct which the PDF specification (published in 2008, in use for nine years!) calls obsolete and even the PDF Reference the file claims conformance to recommends against. iText, on the other hand, does not support this obsolete construct.

    Ideas how to fix this

    Essentially the PDF Cleanup code must be extended to

    • remember the resources of the current page in the PdfCleanUpProcessor and
    • use these current page resources in the PdfCleanUpContentOperator method invoke in case of a Do operator referring to form XObject without own resources.

    Unfortunately some members used in invoke are private. Thus, one has to either copy the PdfCleanUp code or fall back on reflection.

    (iText 5.5.12-SNAPSHOT)

    iText 7

    The iText 7 PDF CleanUp tool also runs into an issue for your PDF, here the exception is a IllegalStateException claiming "Graphics state is always deleted after event dispatching. If you want to preserve it in renderer info, use preserveGraphicsState method after receiving renderer info."

    As this exception is thrown during event dispatching, this error message does not make sense. Unfortunately the PDF CleanUp tool has become closed source in iText 7, so it is not so easy pinpointing the issue.

    (iText 7.0.3-SNAPSHOT; PDF CleanUp 1.0.2-SNAPSHOT)