My goal is to process a .docx document in Java, using Apache POI. I want to extract everything from the document to create a new one, but only with specific content, that I can choose from the processed document. That works so far for tables and text, but I have a Problem regarding pictures. Normally I would extract them like this:
List<XWPFPictureData> images = r.getEmbeddedPictures();
Where r is extracted from a paragraph and is of type XWPFRun
.
The big problem here is, that this solution only works for some images, it depends on how the image is inserted in the word document.
I can access the xml code of a run and tried to find images like this, that worked fine in python where you can state a xpath query. I tried the same in Java but got an error message.
Here is my code to check if a run contains an image:
r.getCTR().selectPath(".//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed"))
All the available engines are namespace aware ones. So the namespaces must be declared.
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlObject;
public class WordRunSelectPath {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordInsertPictures.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
for (XWPFRun run : paragraph.getRuns()) {
String declareNameSpaces = "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main'; "
+ "declare namespace wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing'; "
+ "declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main'; "
+ "declare namespace pic='http://schemas.openxmlformats.org/drawingml/2006/picture'; "
+ "declare namespace r='http://schemas.openxmlformats.org/officeDocument/2006/relationships' ";
XmlObject[] selectedObjects = run.getCTR().selectPath(
declareNameSpaces
+ ".//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic/pic:blipFill/a:blip/@r:embed");
if (selectedObjects.length > 0) {
String rID = selectedObjects[0].newCursor().getTextValue();
System.out.println(rID);
}
}
}
document.close();
}
}