Search code examples
javaapache-poidocx

How can I dump XML body of XWPFDocument?


This seems like it should be easy, but I can't find the answer anywhere.

Using Java 8, and Apache POI and Apache POI-OOXML 4.1.2, we are converting documents from an XML-based derivative of EPUB3 into the DOCX format. I'm new to the project, and am trying to debug something. As part of my debugging toolkit, I'd like to dump the XML in the equivalent of the document.xml file within a .docx file to a string that I can print out or save.

I tried XWPFWordExtractor, but that seems to print out text and not XML. I also tried .toString(), which appears to print out the address of the object, and iterating through the results of getBodyElementsIterator(), which isn't quite it either.

This helped me print bytes, but not the XML I wanted: Can XWPFDocument be converted to a Byte[] without saving it to a file first?

I just want something like

public void dumpDocx(final XWPFDocument docx) {
    System.out.println(docx.getBody().toXml().toString());
}

And I'd like the output to be the XML representing the contents of document.xml.


Solution

  • A *.docx file is simply a ZIP archive containing multiple XML files and other files too. So after XWPFDocument.write the result, either a file or bytes, can be handled as such, unzipped and looked at /word/document.xml for example.

    But if one wants avoid writing out the whole document, then one needs to know that XWPFDocument internally bases on org.openxmlformats.schemas.wordprocessingml.x2006.main.CT* objects which all extend org.apache.xmlbeans.XmlObject. And XmlObject.toString() returns the XML as String. For the document XML, XWPFDocument.getDocument returns a org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1 which is the representaton of /word/document.xml.

    So System.out.println(docx.getDocument().toString()); will print the XML of the underlying CTDocument1.

    Unfortunately org.apache.xmlbeans.XmlObject only represents the contents of an element or attribute, not the element or attribute itself. So when you validate or save an XmlObject, you are validating or saving its contents, not its container. For CTDocument1 that means, it contains the body elements but not the document container itself. To get the document container itself as an XmlObject one needs a org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument object which contains the CTDocument1.

    Example for print document XML from XWPFDocument:

    import java.io.FileOutputStream;
    
    import org.apache.poi.xwpf.usermodel.*;
    
    public class CreateXWPFDocumentDumpDocumentXML {
        
     static void printDocumentXML(XWPFDocument docx) throws Exception {
         
      String xml;
      
      System.out.println("Contents of org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1:");
      org.apache.xmlbeans.XmlObject documentXmlObject = docx.getDocument();
      xml = documentXmlObject.toString();  
      System.out.println(xml);   
      
      System.out.println("Contents of whole DocumentDocument:");
      org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDocument1 ctDocument1 = docx.getDocument();
      org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument documentDocument = org.openxmlformats.schemas.wordprocessingml.x2006.main.DocumentDocument.Factory.newInstance();
      documentDocument.setDocument​(ctDocument1);
      xml = documentDocument.toString();
      System.out.println(xml);
      
     }
    
     public static void main(String[] args) throws Exception {
    
      XWPFDocument docx = new XWPFDocument();
      XWPFParagraph paragraph = docx.createParagraph();
      XWPFRun run=paragraph.createRun(); 
      run.setBold(true);
      run.setFontSize(22);
      run.setText("The paragraph content ...");
      paragraph = docx.createParagraph();
    
      printDocumentXML(docx);
    
      try (FileOutputStream out = new FileOutputStream("./XWPFDocument.docx")) {
        docx.write(out);
      } 
    
     }
    }