I have a supposedly unicode string like this:
u'\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xaf\xc2\xbc\xc2\x81\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xc2\xbc\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xe2\u20ac\u0161\xc2\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\xb0\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xc6\u2019\xe2\u20ac\xa0\xc3\xa3\xe2\u20ac\u0161\xc2\xa3\xc3\xa3\xc6\u2019\xc2\x90\xc3\xa3\xc6\u2019\xc2\xab\xc3\xaf\xc2\xbc\xcb\u2020\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xaf\xc2\xbc\xe2\u20ac\xb0'
How do I get the correct unicode string out of this? I think, the actual unicode value is ラブライブ!スクールアイドルフェスティバル(スクフェス)
You have a Mojibake, an incorrectly decoded piece text.
You can use the ftfy
library to un-do the damage:
>>> from ftfy import fix_text
>>> fix_text(s)
u'\u30e9\u30d6\u30e9\u30a4\u30d6!\u30b9\u30af\u30fc\u30eb\u30a2\u30a4\u30c9\u30eb\u30d5\u30a7\u30b9\u30c6\u30a3\u30d0\u30eb(\u30b9\u30af\u30d5\u30a7\u30b9)'
>>> print fix_text(s)
ラブライブ!スクールアイドルフェスティバル(スクフェス)
According to ftfy
, your data was encoded as UTF-8, then decoded as Windows codepage 1252; the ftfy.fixes.fix_one_step_and_explain()
function shows the repair steps needed:
>>> ftfy.fixes.fix_one_step_and_explain(s)[-1]
[(u'encode', u'sloppy-windows-1252', 0), (u'decode', u'utf-8', 0)]
(the 'sloppy' encoding is needed because not all UTF-8 bytes can be decoded as cp1252
, but some bad decoders then just copy the original byte; the special codec reverses that process).
In fact, in your case this was done twice, not a feat I had seen before:
>>> print s.encode('sloppy-cp1252').decode('utf8').encode('sloppy-cp1252').decode('utf8')
ラブライブ!スクールアイドルフェスティバル(スクフェス)