In order to reach the accessibility level PDF/A-1A, I am setting XMP metadata on a PDF using PDFBox v2.0.13. Before setting the metadata I make a conversion of the file from .docx to pdf. I have tried two ways to make the conversion: one using XDocReport v.2.0.1 and the other one using Docx4j v.6.1.0.
In the Java class I have the following code:
PDDocumentInformation info = pdf.getDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setCreator("MyCreator");
...
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.setDescription(info.getSubject());
dcSchema.addCreator(info.getCreator());
Making the conversion with XDocReport I get the following metadata:
</rdf:Description>
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">Apache PDFBox</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>MyCreator</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
Instead making the conversion with Docx4j I get the following metadata:
<rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
<dc:title>
<rdf:Alt>
<rdf:li lang="x-default">Apache PDFBox</rdf:li>
</rdf:Alt>
</dc:title>
<dc:description>
<rdf:Alt>
<rdf:li lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
</rdf:Alt>
</dc:description>
<dc:creator>
<rdf:Seq>
<rdf:li>MyCreator</rdf:li>
</rdf:Seq>
</dc:creator>
</rdf:Description>
Due to the difference of the metadata produced for "title" and "description", the final pdf produced using XDocReport results PDF/A-1A accessible, while the one produced using Docx4j is not accessible.
The accessibility check is made using VeraPDF.
Since Docx4j produces a more readable PDF, is there a way to fix the metadata in the final pdf?
This is a known problem when xmpbox is used together with certain other libraries, e.g. FOP.
It's the transformer who is the problem.
This code in XmpSerializer.java:
Transformer transformer = TransformerFactory.newInstance().newTransformer();
should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl
class. (Try it)
javadoc: https://docs.oracle.com/javase/7/docs/api/javax/xml/transform/TransformerFactory.html#newInstance()
"The Services API will look for a classname in the file META-INF/services/javax.xml.transform.TransformerFactory in jars available to the runtime."
You can force the default implementation by setting a system property:
System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");
However maybe this will mess up something in the other library.
A different solution would be to copy the source code of XmpSerializer, and to change the newInstance call like this:
Transformer transformer = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer();