I am using pdfbox-1.8.12 to read content from PDF to get XFA. I have been able to get XFA for most of the files successfully without missing out on any field values.
The trouble is with some files like error.pdf. I have many of the fields having no values like CIN, but when I open the file in any PDF Viewer, foxit or Acrobat it shows that field.
public static byte[] getParsableXFAForm(File file) {
if (file == null)
return null;
PDDocument doc;
PDDocumentCatalog catalog;
PDAcroForm acroForm;
PDXFA xfa;
try {
doc = PDDocument.load(file);
catalog = doc.getDocumentCatalog();
acroForm = catalog.getAcroForm();
xfa = acroForm.getXFA();
byte[] xfaBytes = xfa.getBytes();
doc.close();
return xfaBytes;
} catch (IOException e) {
// handle IOException
// happens when the file is corrupt.
System.out.println("IOException");
return null;
}
}
Then the byte[] is converted to String.
This is the xfa for this file and if you search in this for 'U72300DL1996PLC075672', it would be missing.
This is a normal file, that gives all fields.
Any Ideas? I have tried everything, but my guess is that since readers can see that value, I should be able to as well.
EDIT : You will have to download the files, you might not be able to view them in the browser.
There are multiple entries of XFA content within the form representing the different states the form had prior to applying the different signatures. As you are using
PDDocument.load(file)
the PDF is parsed sequentially and the most current XFA content is not picked up. If you change that to
PDDocument.loadNonSeq(file,null)
the Xref information is used and the most current XFA is extracted containing the information you are looking for.
Note that for PDFBox 1.8.x one should always use PDDocument.loadNonSeq
in order to parse the PDF in line with the specification i.e. by following the Xref information. PDDocument.load
should only be used to handle files with (Xref related) parsing errors where a sequential parsing can be a fall back.
For PDFBox 2.x PDDocument.load
parses following the Xref i.e. like `PDDocument.loadNonSeq' in 1.8 and sequential parsing is done behind the scenes in case there are errors.