Search code examples
javaapache-poiarabicarabic-support

Apache POI Mirroring Words in Arabic Language


I'm developing an Arabic OCR application in java which extracts Arabic texts in images and then saving the text into a Microsoft Word file, for this purpose i use Apache-POI library.

My problem is that when i extract some text the order of the words are fine but when i save it in a Word file the order of the words are kinda messed up and looks mirrored

for example: Extracting

BUT after saving it as a Word:

and here is the code for saving the Word file:

public class SavingStringAsWordDoc {


    File f=theGUI.toBeSavedWord;

    public void saveAsWorddd (){
        String st=TesseractPerformer.toBeShown;

        try(FileOutputStream fout=new FileOutputStream(f);XWPFDocument docfile=new XWPFDocument()){

            XWPFParagraph paraTit=docfile.createParagraph();
            paraTit.setAlignment(ParagraphAlignment.LEFT);
            XWPFRun paraTitRun=paraTit.createRun();
            paraTitRun.setBold(true);
            paraTitRun.setFontSize(15);
            paraTit.setAlignment(ParagraphAlignment.RIGHT);
            docfile.createParagraph().createRun().setText(st);  //content to be written
            docfile.write(fout); //adding to output stream
        } catch(IOException e){
            System.out.println("IO ERROR:"+e);
        }
    }

i noticed one thing which might help understanding the problem: if i copy the messed up text in the word file and then paste it by choosing the (Keep Text Only) paste option it fixes the order of the paragraph enter image description here


Solution

  • This needs bidirectional text direction support (bidi) and is not yet implemented in XWPF of apache poi per default. But the underlying object org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr supports this. So we must get this underlying object from the XWPFParagraph and then set Bidi on.

    Example:

    import java.io.File;
    import java.io.FileOutputStream;
    import java.nio.charset.StandardCharsets;
    import java.nio.file.Files;
    
    import org.apache.poi.xwpf.usermodel.*;
    import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
    import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr;
    import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
    
    public class CreateWord {
    
     public static void main(String[] args) throws Exception {
    
      String content = Files.readString(new File("ArabicTextFile.txt").toPath(), StandardCharsets.UTF_16);
    
      XWPFDocument document = new XWPFDocument();
    
      XWPFParagraph paragraph = document.createParagraph();
    
      // set bidirectional text support on
      CTP ctp = paragraph.getCTP();
      CTPPr ctppr = ctp.getPPr();
      if (ctppr == null) ctppr = ctp.addNewPPr();
      ctppr.addNewBidi().setVal(STOnOff.ON);
    
      XWPFRun run=paragraph.createRun(); 
      run.setBold(true);
      run.setFontSize(22);
      run.setText(content);
    
      FileOutputStream out = new FileOutputStream("CreateWord.docx");
      document.write(out);
      out.close();
      document.close();
    
     }
    }
    

    My ArabicTextFile.txt contains the text

    هذا هو النص باللغة العربية لاختبار النص باللغة العربية

    in UTF-16 encoding (Unicode).

    Result in Word:

    enter image description here