I am trying to read the content of docx file to string using Apache POI. Im able to read the contents but when the number of pages in docx is more than 7 0r 8, the contents from the 8th page is displayed before the first 7 pages. we are using following code:
File doc=new File("D:\\1.docx");
InputStream repoDocument = new FileInputStream(doc);
XWPFDocument document=new XWPFDocument(repoDocument);
XWPFWordExtractor extractor = new XWPFWordExtractor(document) ;
String content = extractor.getText();
content = content.replace(" ", "");
System.out.println(content);
can anyone help us in fixing this..?
Since this question is tagged docx4j, I take it you are also interested in how you'd solve this problem that way.
It is done using org.docx4j.TextUtils
Here's a demo:
public static void main(String[] args) throws Exception {
String inputfilepath = "YOUR_PATH/YOUR.docx";
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement();
Writer out = new OutputStreamWriter(System.out);
org.docx4j.TextUtils.extractText(wmlDocumentEl, out);
out.close();
}