Search code examples
javaapache-poidocxdocx4j

Unable to read more than 7 pages of docx to String using Apache POI


I am trying to read the content of docx file to string using Apache POI. Im able to read the contents but when the number of pages in docx is more than 7 0r 8, the contents from the 8th page is displayed before the first 7 pages. we are using following code:

File doc=new File("D:\\1.docx");
        InputStream repoDocument = new FileInputStream(doc);
        XWPFDocument document=new XWPFDocument(repoDocument);

    XWPFWordExtractor extractor = new XWPFWordExtractor(document) ;
    String content =  extractor.getText();
    content = content.replace(" ", "");
    System.out.println(content);

can anyone help us in fixing this..?


Solution

  • Since this question is tagged docx4j, I take it you are also interested in how you'd solve this problem that way.

    It is done using org.docx4j.TextUtils

    Here's a demo:

    public static void main(String[] args) throws Exception {
    
        String inputfilepath = "YOUR_PATH/YOUR.docx";
    
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new java.io.File(inputfilepath));
        MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();        
    
        org.docx4j.wml.Document wmlDocumentEl = (org.docx4j.wml.Document)documentPart.getJaxbElement();
    
        Writer out = new OutputStreamWriter(System.out);
    
        org.docx4j.TextUtils.extractText(wmlDocumentEl, out);
    
        out.close();
    
    }