Search code examples
pdfboxapache-fopdocx4jpdfaxdocreport

How to fix PDF/A metadata set by PDFBox (working with Docx4j and XDocReport)


In order to reach the accessibility level PDF/A-1A, I am setting XMP metadata on a PDF using PDFBox v2.0.13. Before setting the metadata I make a conversion of the file from .docx to pdf. I have tried two ways to make the conversion: one using XDocReport v.2.0.1 and the other one using Docx4j v.6.1.0.

In the Java class I have the following code:

PDDocumentInformation info = pdf.getDocumentInformation();
info.setTitle("Apache PDFBox");
info.setSubject("Apache PDFBox adding meta-data to PDF document");
info.setCreator("MyCreator");
...
DublinCoreSchema dcSchema = metadata.createAndAddDublinCoreSchema();
dcSchema.setTitle(info.getTitle());
dcSchema.setDescription(info.getSubject());
dcSchema.addCreator(info.getCreator());

Making the conversion with XDocReport I get the following metadata:

  </rdf:Description>
    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:title>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Apache PDFBox</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:description>
        <rdf:Alt>
          <rdf:li xml:lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
        </rdf:Alt>
      </dc:description>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>MyCreator</rdf:li>
        </rdf:Seq>
      </dc:creator>
   </rdf:Description>

Instead making the conversion with Docx4j I get the following metadata:

    <rdf:Description xmlns:dc="http://purl.org/dc/elements/1.1/" rdf:about="">
      <dc:title>
        <rdf:Alt>
          <rdf:li lang="x-default">Apache PDFBox</rdf:li>
        </rdf:Alt>
      </dc:title>
      <dc:description>
        <rdf:Alt>
          <rdf:li lang="x-default">Apache PDFBox adding meta-data to PDF document</rdf:li>
        </rdf:Alt>
      </dc:description>
      <dc:creator>
        <rdf:Seq>
          <rdf:li>MyCreator</rdf:li>
        </rdf:Seq>
      </dc:creator>
    </rdf:Description>

Due to the difference of the metadata produced for "title" and "description", the final pdf produced using XDocReport results PDF/A-1A accessible, while the one produced using Docx4j is not accessible.

The accessibility check is made using VeraPDF.

Since Docx4j produces a more readable PDF, is there a way to fix the metadata in the final pdf?


Solution

  • This is a known problem when xmpbox is used together with certain other libraries, e.g. FOP.

    It's the transformer who is the problem.

    This code in XmpSerializer.java:

    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    

    should return a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl class. (Try it)

    javadoc: https://docs.oracle.com/javase/7/docs/api/javax/xml/transform/TransformerFactory.html#newInstance()

    "The Services API will look for a classname in the file META-INF/services/javax.xml.transform.TransformerFactory in jars available to the runtime."

    You can force the default implementation by setting a system property:

    System.setProperty("javax.xml.transform.TransformerFactory", "com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl");
    

    However maybe this will mess up something in the other library.

    A different solution would be to copy the source code of XmpSerializer, and to change the newInstance call like this:

    Transformer transformer = TransformerFactory.newInstance("com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl", null).newTransformer();
    

    Source