Search code examples
utf-8character-encodingiconvmojibakesrt

Japanese SRT files garbled, can't determine encoding to fix with iconv


I have an srt file, excerpt:

2
00:00:36,208 --> 00:00:39,667
Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí!

3
00:00:57,917 --> 00:01:00,917
Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí;
Óïõ ðÞñá äþñï ãåíåèëßùí.

4
00:01:00,958 --> 00:01:03,208
Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí
íá ìïõ ðÜñåéò êÜôé.

5
00:01:03,250 --> 00:01:06,375
Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ.
Êáé èá ôï öáò.

6
00:01:06,417 --> 00:01:08,875
Ùñáßá. ¸ôóé êé áëëéþò
èá Ýôñùãá êïñìü.

7
00:01:08,917 --> 00:01:10,208
Äåí èá Ýôñùãåò.

8
00:01:10,208 --> 00:01:11,000
Íáé. ÂëÝðåéò...

9
00:01:11,000 --> 00:01:12,417
...üëá ôá ðñÜãìáôá ðïõ Þèåëåò
íá ìïõ êÜíåéò...

10
00:01:12,417 --> 00:01:13,958
...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.

Supposedly these are japanese subtitles, but obviously it is garbled from encoding issue. I am trying to figure out how to correct it and convert to UTF-8 ultimately. Anyone have any ideas?

File output: UTF-8 Unicode (with BOM) text, with CRLF line terminators

File can be obtained here for testing: http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja


Solution

  • What you have is a document that has been transcoded from the ISO-8859-1 character set to the UTF-8 encoding scheme, but the document source was coded in the ISO-8859-7 character set. After the transcoding to UTF-8, a U+FEFF byte order mark (BOM) has been added and a few quotation marks (U+201C, U+201D).

    The language is Greek and 2nd subtitle sequence when corrected is:

    2
    00:00:36,208 --> 00:00:39,667
    Θα σε σκοτώσω, Γουάιντζελστιν!
    

    The English translation is "I'll kill you, Gouaintzelstin!".

    To reverse/correct it:

    1. Decode the document from the UTF-8 encoding scheme
    2. Remove all code-points greater than U+00FF
    3. Encode the document using the ISO-8859-1 encoding
    4. Transcode the document using the ISO-8859-7 encoding to the UTF-8 encoding scheme.

    An implementation of the above in Perl:

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use Encode qw[];
    
    (@ARGV == 1 && -f $ARGV[0])
      or die qq[Usage: $0 <file>];
    
    my $file = shift @ARGV;
    
    my ($octets, $string);
    
    # Read all the octets from the file
    $octets = do {
        open my $fh, '<:raw', $file
          or die qq[Could not open '$file' for reading: '$!'];
        local $/; <$fh>
    };
    
    # Decode the octets using the UTF-8 encoding scheme
    $string = Encode::decode('UTF-8', $octets, Encode::FB_CROAK);
    
    # Remove all code points greater than U+00FF
    $string =~ s/[^\x00-\xFF]//g; 
    
    # Encode the string using the ISO-8859-1 encoding
    $octets = Encode::encode('ISO-8859-1', $string);
    
    # Decode the octets using the ISO-8859-7 encoding
    $string = Encode::decode('ISO-8859-7', $octets);
    
    # Encode the string using the UTF-8 encoding
    $octets = Encode::encode('UTF-8', $string);
    
    # Output the octets on standard output
    print $octets;