Search code examples
javaapache-poibidi

bidi string can't be read from Word (Apache POI)


I'm writing a bidi String to an MS Word file using Apache POI after wrapping it with the sequence aString = "\u202E" + aString + "\u202C"; The text renders correctly in the file, and reads fine when I retrieve the string again. But if I modify the file in anyway, suddenly, reading that string returns true with isBlank(). Thank you in advance for any suggestions/help!


Solution

  • When Microsoft Word stores bidirectional text in it's Office Open XML *.docx format, then it sometimes uses special text run elements w:bdo (bi directional orientation). Apache poi does not read those elements until now. So if a XWPFParagraph contains such elements, then paragraph.getText() will return an empty string.

    One could using org.apache.xmlbeans.XmlCursor to really get all text from all XWPFParagraphs like so:

    import java.io.FileInputStream;
    
    import org.apache.poi.xwpf.usermodel.*;
    
    import org.apache.xmlbeans.XmlCursor;
    
    public class ReadWordParagraphs {
        
     static String getAllTextFromParagraph(XWPFParagraph paragraph) {
      XmlCursor cursor =  paragraph.getCTP().newCursor();
      return cursor.getTextValue();
     }
    
     public static void main(String[] args) throws Exception {
    
      XWPFDocument document = new XWPFDocument(new FileInputStream("WordDocument.docx"));
      
      for (XWPFParagraph paragraph : document.getParagraphs()) {
       System.out.println(paragraph.getText()); // will not return text in w:bdo elements
       System.out.println(getAllTextFromParagraph(paragraph)); // will return all text content of paragraph
      }
     }
    }