XFA building in PDFBox 1.8.12, not in 2.0.4

I have tried to extract XFA from a file and it's worked fine for me till I updated the PDFBox from 1.8.12 to 2.0.4.

I have a file that I can extract XFA from using 1.8.12 but not using 2.0.4.

When I extract this using PDFBox using 2.0.4, I get the structure of XFA but almost all the values are missing. On the other hand when I try to extract the same form using 1.8.12 it comes out fine.

I looked into a similar issue on SO. It's said to be fixed in 2.0.4 but I am still facing issues.

Any ideas?

I have included the files

Generated XFA-1.8.12

Generated XFA-2.0.4

File used

EDIT#1

For 2.0.4

    // returns PDXFA
    public static byte[] getParsableXFAForm(File file) {
        if (file == null)
            return null;
        PDDocument doc;
        PDDocumentCatalog catalog;
        PDAcroForm acroForm;

        PDXFAResource xfa;
        try {
//            String pass = null;
            doc = PDDocument.load(file);
            if (doc == null)
                return null;
//             flattenPDF(doc);
            doc.setAllSecurityToBeRemoved(true);
            // System.out.println("Security " + doc.isAllSecurityToBeRemoved());
            catalog = doc.getDocumentCatalog();
            if (catalog == null) {
                doc.close();
                return null;
            }
            acroForm = catalog.getAcroForm();
            if (acroForm == null) {
                doc.close();
                return null;
            }
            xfa = acroForm.getXFA();
            if (xfa == null) {
                doc.close();
                return null;
            }
            // TODO return byte[]
            byte[] xfaBytes = xfa.getBytes();
            doc.close();
            return xfaBytes;
        } catch (IOException e) {
            // handle IOException
            // happens when the file is corrupt.
            e.printStackTrace();
            System.out.println("XFAUtils-getParsableXFAForm-IOException");
            return null;
        }
    }

For 1.8.12

public static byte[] getParsableXFAForm(File file) {
        if (file == null)
            return null;
        PDDocument doc;
        PDDocumentCatalog catalog;
        PDAcroForm acroForm;
        PDXFA xfa;
        try {
            doc = PDDocument.loadNonSeq(file, null);
            if (doc == null)
                return null;
            // flattenPDF(doc);
            doc.setAllSecurityToBeRemoved(true);
            // System.out.println("Security " + doc.isAllSecurityToBeRemoved());
            catalog = doc.getDocumentCatalog();
            if (catalog == null) {
                doc.close();
                return null;
            }
            acroForm = catalog.getAcroForm();

            if (acroForm == null) {
                doc.close();
                return null;
            }
            xfa = acroForm.getXFA();
            if (xfa == null) {
                doc.close();
                return null;
            }
            // TODO return byte[]
            byte[] xfaBytes = xfa.getBytes();
            doc.close();
            return xfaBytes;
        } catch (IOException e) {
            // handle IOException
            // happens when the file is corrupt.
//          e.printStackTrace();
            System.out.println("XFAUtils-getParsableXFAForm-IOException");
            return null;
        }
}

Solution

At first glance

There are 6 revisions in your PDF in the course of which the XFA form has been filled in more and more. Your 1.8.12 code extracts the most current version of the XFA form while your 2.0.4 code extracts the oldest version of it.

I ran your 2.0.4 code using the PDFBox version 2.0.4, 2.0.5, and the current development snapshot 2.1.0-SNAPSHOT. In version 2.0.4 I indeed could reproduce that the oldest revision of the XFA form was loaded, but using 2.0.5 or 2.1.0-SNAPSHOT the current revision was loaded.

This appears to be a shortcoming in PDFBox 2.0.0...2.0.4 which has been fixed in 2.0.5.

On closer examination

As a bug in PDFBox 2.0.4 reading the XFA form from the wrong revision of the file seemed quite implausible, I looked into this some more.

In particular I had a closer look at the PDF file itself. And indeed, it turned out that the file has 10 trash bytes before the actual PDF file header!

These additional trash bytes made the cross references and offsets relative to the file start all be wrong. Thus, PDFBox cannot parse the file in a regular manner but instead has to do some kind of repair.

Looking at the differences between 2.0.4 and 2.0.5 there in particular have been substantial changes in the code to repair PDFs with broken cross references and offsets. While PDFBox 2.0.4 could only partially repair the file (finding only the initial XFA revision), therefore, PDFBox 2.0.5 succeeded in a more complete repair, finding in particular the newest XFA revision.

Having fixed the OP's PDF (i.e. having removed the leading trash bytes, cf. XFA-File-fixed.pdf), I could successfully extract the current XFA form revision using PDFBox versions 2.0.0...2.0.4, too.

Thus, this is not a PDFBox bug as I initially assumed but merely a broken PDF file which PDFBox file repair functionalities could not properly fix before PDFBox 2.0.5 improvements.