Search code examples
javacharacter-encodingright-to-leftgoogle-castbidi

Converting from windows-1256 to UTF-8 causes punctuation issue


I have an Arabic subtitle I've trying to convert from SRT to VTT. The subtitles seems to be using windows-1256 according to the character encoding detector on ICU (Java). The final VTT file is on UTF-8.

The subtitle converts fine and it all looks right except for the punctuation moves from the left side to the right side. I am using this subtitle on the Chromecast so at first I thought it was an issue with the Chromecast but even gedit on Linux has the issue. However LibreOffice does not have the issue. Nor does the console output on IntelliJ.

I wrote a simple piece of code to recreate the issue without actually converting from SRT to VTT, just by converting from windows-1256 to UTF-8.

BufferedReader reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("arabic sub.srt"), "windows-1256")
);
String line = null;
BufferedWriter writer = new BufferedWriter(
    new OutputStreamWriter(new FileOutputStream("bad punctuation.srt"), "UTF-8")
);
while((line = reader.readLine())!= null){
    System.out.println(line);
    writer.write(line);
    writer.write("\r\n");
}
writer.close();
reader = new BufferedReader(
    new InputStreamReader(new FileInputStream("bad punctuation.srt"), "UTF-8")
);
line = null;

while((line = reader.readLine())!= null){
    System.out.println(line);
}

Here is the output from the IntelliJ console:

Intellij Console

As you can see the dot is on the left side which I guess is correct.

Here is what gedit shows:

gEdit

Most of the text is to the right which I guess is correct but the period is on the right, which I guess is wrong.

Here is LibreOffice:

enter image description here

Which is mostly correct, the punctuation is to the left, however the text is also on the left and I guess it should be on the right.

This is the subtitle I'm testing https://www.opensubtitles.org/en/subtitles/5168225/game-of-thrones-fire-and-blood-ar

I also tried a different SRT that was originally encoded as UTF-8 and that one worked fine without issues. So my guess is that the conversion from windows-1256 is the issue.

So what is the issue with the way I'm re-encoding the file?

Thanks.

Edit: Forgot a chromecast picture.

enter image description here

As you can see the punctuation is on the wrong side.

EDIT: I just noticed that Linux chardet says it is MacCyrillic not windows-1256. But the Java ICU library says windows-1256. Anyways, if I use MacCyrillic then the punctuation looks fine on gEdit but the text itself doesn't look right, like it is now using garbage characters.


Solution

  • Looking at the original subtitles file, I can tell for sure that it is badly formatted. The full-stops seem to appear before the text even when it is displayed with a left-to-right character set. I believe the correct character set is windows-1256 though.

    The only way this would display correctly is if the punctuation at the beginning of the line is displayed LTR while the rest of the line is displayed RTL. You could try to force this by adding a UTF-8 left-to-right mark right after the punctuation.

    If you prefer to fix the original file instead, you would need to move any punctuation from the beginning of the line to the end. The brackets at the beginning of the line would also need to be reversed.