I'm reading mojibaked ID3 tags with mutagen
. My goal is to fix the mojibake while learning about encodings and Python's handling thereof.
The file I'm working with has an ID3v2
tag, and I'm looking at its album (TALB
) frame, which is, according to the encoding byte in the TALB
ID3 frame, encoded in Latin-1 (ISO-8859-1
). I know that the bytes in this frame, however, are encoded in cp1251
(Cyrillic).
Here's my code so far:
>>> from mutagen.mp3 import MP3
>>> mp3 = MP3(paths[0])
>>> mp3['TALB']
TALB(encoding=0, text=[u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'])
Now, as you can see, mp3['TALB'].text[0]
is represented here as a Unicode string. However, it's mojibaked:
>>> print mp3['TALB'].text[0]
Áóðæóéñêèå ïëÿñêè
I am having very little luck at transcoding these cp1251
bytes into their correct Unicode codepoints. My best results so far have been very unbecoming:
>>> st = ''.join([chr(ord(x)) for x in mp3['TALB'].text[0]]); st
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Буржуйские пляски <-- **this is the correct, demojibaked text!**
As I understand this approach, it works because I end up transforming the Unicode string into an 8-bit string, which I can then decode into Unicode, while specifying the encoding I am decoding from.
The problem is that I can't decode('cp1251')
on the Unicode string directly:
>>> st = mp3['TALB'].text[0]; st
u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> print st.decode('cp1251')
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/Users/dmitry/dev/mp3_tag_encode_convert/lib/python2.7/encodings/cp1251.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-9: ordinal not in range(128)
Can someone explain this? I can't understand how to make it not decode into the 7-bit ascii
range when operating directly on the u''
string.
First, encode it in the encoding that you know it is already in.
>>> tag = u'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
>>> raw = tag.encode('latin-1'); raw
'\xc1\xf3\xf0\xe6\xf3\xe9\xf1\xea\xe8\xe5 \xef\xeb\xff\xf1\xea\xe8'
Then you can decode it in the proper encoding.
>>> fixed = raw.decode('cp1251'); print fixed
Буржуйские пляски