I'm developing an Arabic OCR application in java which extracts Arabic texts in images and then saving the text into a Microsoft Word file, for this purpose i use Apache-POI library.
My problem is that when i extract some text the order of the words are fine but when i save it in a Word file the order of the words are kinda messed up and looks mirrored
BUT after saving it as a Word:
and here is the code for saving the Word file:
public class SavingStringAsWordDoc {
File f=theGUI.toBeSavedWord;
public void saveAsWorddd (){
String st=TesseractPerformer.toBeShown;
try(FileOutputStream fout=new FileOutputStream(f);XWPFDocument docfile=new XWPFDocument()){
XWPFParagraph paraTit=docfile.createParagraph();
paraTit.setAlignment(ParagraphAlignment.LEFT);
XWPFRun paraTitRun=paraTit.createRun();
paraTitRun.setBold(true);
paraTitRun.setFontSize(15);
paraTit.setAlignment(ParagraphAlignment.RIGHT);
docfile.createParagraph().createRun().setText(st); //content to be written
docfile.write(fout); //adding to output stream
} catch(IOException e){
System.out.println("IO ERROR:"+e);
}
}
i noticed one thing which might help understanding the problem:
if i copy the messed up text in the word file and then paste it by choosing the (Keep Text Only) paste option it fixes the order of the paragraph
This needs bidirectional text direction support (bidi) and is not yet implemented in XWPF
of apache poi per default. But the underlying object org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr
supports this. So we must get this underlying object from the XWPFParagraph
and then set Bidi
on.
Example:
import java.io.File;
import java.io.FileOutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTPPr;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.STOnOff;
public class CreateWord {
public static void main(String[] args) throws Exception {
String content = Files.readString(new File("ArabicTextFile.txt").toPath(), StandardCharsets.UTF_16);
XWPFDocument document = new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
// set bidirectional text support on
CTP ctp = paragraph.getCTP();
CTPPr ctppr = ctp.getPPr();
if (ctppr == null) ctppr = ctp.addNewPPr();
ctppr.addNewBidi().setVal(STOnOff.ON);
XWPFRun run=paragraph.createRun();
run.setBold(true);
run.setFontSize(22);
run.setText(content);
FileOutputStream out = new FileOutputStream("CreateWord.docx");
document.write(out);
out.close();
document.close();
}
}
My ArabicTextFile.txt
contains the text
هذا هو النص باللغة العربية لاختبار النص باللغة العربية
in UTF-16 encoding (Unicode).
Result in Word
: