Search code examples
audioencodingutf-8character-encodingiso-8859-1

Unreadable symbols instead og russian text in ISO-8859-1 file


I have issues with encoding.

I have downloaded many albums, and as I can see, the whole archive was made in Windows environment, as .cue files are in ISO-8859-1 encoding, and Cyrillic (Russian) text in them are unreadable. Example:

REM GENRE Rock
REM DATE 1987
REM DISCID B407700D
REM COMMENT "ExactAudioCopy v0.99pb4"
PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
TITLE "Èãðà â áèñåð ïåðåä ñâèíüÿìè [ÕÎÐ]"
FILE "Ãðàæäàíñêàÿ Îáîðîíà - Èãðà â áèñåð ïåðåä ñâèíüÿìè [ÕÎÐ].flac" WAVE
  TRACK 01 AUDIO
    TITLE "Èãðà â áèñåð"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 01 00:00:00
  TRACK 02 AUDIO
    TITLE "Íà íàøèõ ãëàçàõ"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 02:22:06
    INDEX 01 02:24:59
  TRACK 03 AUDIO
    TITLE "×óæåðîäíûì ýëåìåíòîì (÷àñòèöåé ëæè)"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 04:31:69
    INDEX 01 04:34:27
  TRACK 04 AUDIO
    TITLE "ß èëëþçîðåí"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 07:30:40
    INDEX 01 07:32:61
  TRACK 05 AUDIO
    TITLE "Äåòñêèé Ìèð"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 10:58:50
    INDEX 01 11:00:12
  TRACK 06 AUDIO
    TITLE "Çîîïàðê"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 12:54:46
    INDEX 01 12:58:08
  TRACK 07 AUDIO
    TITLE "ÊÁÃ-Ðîê (Ðîê-ÊÁÃ)"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 15:41:48
    INDEX 01 15:43:39
  TRACK 08 AUDIO
    TITLE "Ñêîðî íàñòàíåò ñîâñåì"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 17:57:23
    INDEX 01 17:59:51
  TRACK 09 AUDIO
    TITLE "Íåíàâèæó êðàñíûé öâåò"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 20:59:09
    INDEX 01 21:01:20
  TRACK 10 AUDIO
    TITLE "Îí óâèäåë ñîëíöå"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 23:07:31
    INDEX 01 23:09:45
  TRACK 11 AUDIO
    TITLE "Îïòèìèçì"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 25:48:46
    INDEX 01 25:51:13
  TRACK 12 AUDIO
    TITLE "Ìàìà, ìàìà..."
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 27:53:58
    INDEX 01 27:56:03
  TRACK 13 AUDIO
    TITLE "Óáèéöà"
    PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"
    INDEX 00 30:34:19
    INDEX 01 30:36:12

Tho it is the file that I have already converted from ISO-8859-1 to UTF-8, but the previous variant had squares with question mark (?) instead. So, how would I make this abracadabra readable?


Solution

  • Your file isn't iso-8859-1 encoded. I'd guess it's cp1251. You face a mojibake case (example in Python for its universal intelligibility):

    'PERFORMER "Ãðàæäàíñêàÿ Îáîðîíà"'.encode('iso-8859-1').decode('cp1251')
    
    'PERFORMER "Гражданская Оборона"'