Search code examples
javacharacter-encodingjava-8rtf

Can't Read RTF ANSi File contains Arabic Characters


I have RTF files are encoded in ANSI while it contains Arabic phrases. I'm trying to read this file but couldn't read it in the right encoding.

RTF File:

{\rtf1\fbidis\ansi\deff0{\fonttbl{\f0\fnil\fcharset178 MS Sans Serif;}{\f1\fnil\fcharset0 MS Sans Serif;}}

\viewkind4\uc1\pard\ltrpar\lang12289\f0\rtlch\fs16\'ca\'d1\'cc\'e3\'c9: \'d3\'e3\'ed\'d1 \'c7\'e1\'e3\'cc\'d0\'e6\'c8\f1\ltrch\par

}

and my java code is:

RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
rtf.read(new InputStreamReader(new FileInputStream("Document.rtf"), "windows-1256"),doc,0);
System.out.println(doc.getText(0,doc.getLength()));

and the wrong output is:

ÊÑÌãÉ: ÓãíÑ ÇáãÌÐæÈ

Solution

  • Try RTFParserKit, this should correctly support encodings like the ones you describe.

    Here is the text it extracted from your example:

    ترجمة: سمير المجذوب

    I used the RtfDump class which ships with RTFParserKit to dump the RTF content into an XML file. The class invokes the StandardRtfParser on the supplied input file, while the RtfDumpListener class receives the events raised by the parser as the file is read, adding content to the XML file as it goes.