I have a Java application with this dependency
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.3</version>
</dependency>
and this piece of code
XWPFWordExtractor extractor = new XWPFWordExtractor(new XWPFDocument(inputStream));
return extractor.getText();
I'm trying to extract from a word document (.docx) all the text inside. The text is extracted, but if for example there is a text box inside the document, this is ignored and therefore also the text inside.
How can I do to extract the entire text? Including text boxes and if so other possible elements that can contain text?
I solved my problemu by using Apache Tika.
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
parser.parse(inputStream, handler, metadata);
return handler.toString();