Search code examples
javamavenapache-poidocx

Problem extracting text with XWPFWordExtractor


I have a Java application with this dependency

 <dependency>
     <groupId>org.apache.poi</groupId>
     <artifactId>poi-ooxml</artifactId>
     <version>5.2.3</version>
 </dependency>

and this piece of code

 XWPFWordExtractor extractor = new XWPFWordExtractor(new XWPFDocument(inputStream));
 return extractor.getText();

I'm trying to extract from a word document (.docx) all the text inside. The text is extracted, but if for example there is a text box inside the document, this is ignored and therefore also the text inside.

How can I do to extract the entire text? Including text boxes and if so other possible elements that can contain text?


Solution

  • I solved my problemu by using Apache Tika.

    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(-1);
    Metadata metadata = new Metadata();
    parser.parse(inputStream, handler, metadata);
    return handler.toString();