Search code examples
javaxpathapache-poixmlbeansxwpf

Apache POI XWPF - Check if a run contains a picture


My goal is to process a .docx document in Java, using Apache POI. I want to extract everything from the document to create a new one, but only with specific content, that I can choose from the processed document. That works so far for tables and text, but I have a Problem regarding pictures. Normally I would extract them like this:

List<XWPFPictureData> images = r.getEmbeddedPictures();

Where r is extracted from a paragraph and is of type XWPFRun. The big problem here is, that this solution only works for some images, it depends on how the image is inserted in the word document.

I can access the xml code of a run and tried to find images like this, that worked fine in python where you can state a xpath query. I tried the same in Java but got an error message.

Here is my code to check if a run contains an image:

r.getCTR().selectPath(".//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed"))

And it returns this Exception: enter image description here


Solution

  • All the available engines are namespace aware ones. So the namespaces must be declared.

    import java.io.FileInputStream;
    
    import org.apache.poi.xwpf.usermodel.*;
    
    import org.apache.xmlbeans.XmlObject;
    
    public class WordRunSelectPath {
    
     public static void main(String[] args) throws Exception {
    
      XWPFDocument document = new XWPFDocument(new FileInputStream("WordInsertPictures.docx"));
      for (XWPFParagraph paragraph : document.getParagraphs()) {
       for (XWPFRun run : paragraph.getRuns()) {
        String declareNameSpaces =   "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'; " 
                           + "declare namespace wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing'; "
                           + "declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main'; "
                           + "declare namespace pic='http://schemas.openxmlformats.org/drawingml/2006/picture'; "
                           + "declare namespace r='http://schemas.openxmlformats.org/officeDocument/2006/relationships' ";
    
        XmlObject[] selectedObjects = run.getCTR().selectPath(
                             declareNameSpaces 
                           + ".//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed");
        if (selectedObjects.length > 0) {
         String rID = selectedObjects[0].newCursor().getTextValue();
         System.out.println(rID);
        }
       }
      }
    
      document.close();
     }
    }