Search code examples
javapdfbox

PDFBox org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSDictionary


Using PDFBox 2.0.25, process document to get signature dictionaries, example pdf

try{
    doc = PDDocument.load(inputFile);
    doc.getSignatureDictionaries()
}catch(Exception e)
{
    e.printStackTrace();
}

document generated by scanned, producer :

Foxit PhantomPDF Printer Version 6.1.0.0923

warn message in line doc = PDDocument.load(inputFile);

Object (140:0) at offset 4039608 does not end with 'endobj' but with '0'

then get error in line doc.getSignatureDictionaries();

java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast to org.apache.pdfbox.cos.COSDictionary
        at org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.getFields(PDAcroForm.java:378)
        at org.apache.pdfbox.pdmodel.interactive.form.PDFieldTree$FieldIterator.<init>(PDFieldTree.java:79)
        at org.apache.pdfbox.pdmodel.interactive.form.PDFieldTree$FieldIterator.<init>(PDFieldTree.java:68)
        at org.apache.pdfbox.pdmodel.interactive.form.PDFieldTree.iterator(PDFieldTree.java:62)
        at org.apache.pdfbox.pdmodel.PDDocument.getSignatureFields(PDDocument.java:932)
        at org.apache.pdfbox.pdmodel.PDDocument.getSignatureDictionaries(PDDocument.java:952)

why this is happening ? can a file like this be handled ?

*updated : I have tried by replacing from maven repo Apache PDFBox » 2.0.25 to Apache PDFBox » 2.0.26, still getting the same error


Solution

  • The underlying issue is that there is an error in the object stream in the PDF.

    According to the PDF specification ISO 32000 (both part 1 and 2), section 7.5.7 – Object Streams –

    An object in an object stream shall not consist solely of an object reference.

    But the example document shared by @blinkbink does have such objects in object stream, in particular 113 0 R for object 140, 141 0 R for object 157 and 179 0 R for object 191.

    As these object references are forbidden in object streams, many PDF processors parse these references as the only other type of object that starts with an integer, as a number object. For example, the object 140 is parsed as the number 113, not as a reference to object 113 (which happens to be a form field object).

    As a consequence, these PDF processors in the example document find number objects in an array which should only hold form field objects. If form field reading of these processors is not programmed defensively, you get something like the ClassCastException observed here.

    Thus, while PDFBox used to not be defensively programmed here, the main issue is in the PDF producer that created the PDF at hand. An issue should be filed with them.