Search code examples
javadocx4j

Docx to Pdf conversion using docx4j produces an artifact in a numbered list


I am trying to perform a straightforward conversion of docx document to pdf without applying any changes to its content. I am using 'export-FO' approach, as 'Microsoft Graph' and 'documents4j' approaches do not meet the requirements. My document contains a numbered list that causes a production of an artifact in a resulting pdf document. This artifact is always seen as overlaying the first number in a list with the last+1 number of the same list.

What causes this kind of behavior? What can I do to fix it?

Here is the link to the representative image of this artifact

This is the sample code I use to convert documents:

public class Main {
    public static void main(String[] args) throws Exception {
        InputStream templateInputStream = new FileInputStream("document.docx");
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(templateInputStream);

        Mapper fontMapper = new BestMatchingMapper();
        wordMLPackage.setFontMapper(fontMapper);

        OutputStream os = new FileOutputStream("document.pdf");
        Docx4J.toPDF(wordMLPackage, os);
    }
}

a list of dependencies I have in the sample project:

<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-core</artifactId>
    <version>11.5.2</version>
</dependency>

<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-export-fo</artifactId>
    <version>11.5.2</version>
</dependency>

<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-JAXB-ReferenceImpl</artifactId>
    <version>11.5.2</version>
</dependency>

<dependency>
    <groupId>org.apache.xmlgraphics</groupId>
    <artifactId>fop</artifactId>
    <version>2.10</version>
</dependency>

and a source docx document - google drive link here


Solution

  • This seems to be caused by feature PP_COMMON_CONTAINERIZATION.

    It is grouping the list items in a content control, then seems to be incorrectly numbering the content control as well.

    You need to turn that off, but Docx4J.toPDF doesn't give you that option.

    You can use instead:

            FOSettings foSettings =Docx4J.createFOSettings();
            foSettings.setOpcPackage(wordMLPackage);
            foSettings.getFeatures().remove(ConversionFeatures.PP_COMMON_CONTAINERIZATION);
            
            Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
    

    Or

            FOSettings foSettings =Docx4J.createFOSettings();
            foSettings.setOpcPackage(wordMLPackage);                        
            Docx4J.toFO(foSettings, os, Docx4J.FLAG_EXPORT_PREFER_NONXSL); // NONXSL ignores content controls
    

    Now tracking at https://github.com/plutext/docx4j/issues/607