Search code examples
docxapache-poiopenxmlxlsx

How to extract plain text from a DOCX file using the new OOXML support in Apache POI 3.5?


On September 28, 2009 the Apache POI project released version 3.5 which officially supports the OOXML formats introduced in Office 2007, like DOCX and XLSX.

Please provide a code sample for extracting a DOCX file's content in plain text, ignoring any styles or formatting.

I am asking this because I have been unable to find any Apache POI examples covering the new OOXML support.


Solution

  • This worked for me. Make sure you add the required jars (upgrade xmlbeans, etc.)

    public String extractText(InputStream in) throws Exception {
        XWPFDocument doc = new XWPFDocument(in);
        XWPFWordExtractor ex = new XWPFWordExtractor(doc);
        String text = ex.getText();
        return text;
    }