Search code examples
pythonubuntu-18.04vlccodec

Convert Windows-1252 subtitle file to utf-8


I am downloading Serbian/Bosnian/Croatian subtitles via VLC player on an Ubuntu machine, and have constantly to manually change characters such as æ, è, and ð into ć, č, and đ so that the player can render them properly. I wanted to make a python3 function that can do that for me, but I got lost tyring to understand string encoding and decoding.

Through chardata.detect I found that the encoding of .srt files that VLC player downloads is Windows-1252. So right now, I do something like this:

import codecs

f = codecs.open('my_file.srt', 'r', encoding='Windows-1252')
data = f.read()
data_utf8 = data.encode('utf-8')
f.close()

The thing is, when I print to terminal the content of the data varible, I might get a fragment like this: obožavam vaše. But, when I print to terminal the content of the data-utf8 variable, that same fragment now looks like this: obo\xc5\xbeavam va\xc5\xa1e. This is not what I expected.

Furthermore, when I now want to save this data to a file

with open('my_utf8_file.srt', 'w') as f:
    f.write(data_utf8)

I get TypeError: write() argument must be str, not bytes.

Can anyone tell me what am I doing wrong?


Solution

  • You have to use:

    with open('my_utf8_file.srt', 'wb') as f:
        f.write(data_utf8)
    

    Note the 'b', this marks the file as binary so you can write bytes (like printed by .encode()) This is also the reason it prints differently.

    Alternatively, you can do something like:

    with open('my_utf8_file.srt', 'w', encoding='utf-8') as f:
        f.write(data)