Search code examples
pythonunicodeencodingmojibake

Python - convert unicode and hex to unicode


I have a supposedly unicode string like this:

u'\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xa3\xc6\u2019\xc2\xa9\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\u201c\xc3\xaf\xc2\xbc\xc2\x81\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xc2\xbc\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xe2\u20ac\u0161\xc2\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa4\xc3\xa3\xc6\u2019\xe2\u20ac\xb0\xc3\xa3\xc6\u2019\xc2\xab\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xc6\u2019\xe2\u20ac\xa0\xc3\xa3\xe2\u20ac\u0161\xc2\xa3\xc3\xa3\xc6\u2019\xc2\x90\xc3\xa3\xc6\u2019\xc2\xab\xc3\xaf\xc2\xbc\xcb\u2020\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xa3\xe2\u20ac\u0161\xc2\xaf\xc3\xa3\xc6\u2019\xe2\u20ac\xa2\xc3\xa3\xe2\u20ac\u0161\xc2\xa7\xc3\xa3\xe2\u20ac\u0161\xc2\xb9\xc3\xaf\xc2\xbc\xe2\u20ac\xb0'

How do I get the correct unicode string out of this? I think, the actual unicode value is ラブライブ!スクールアイドルフェスティバル(スクフェス)


Solution

  • You have a Mojibake, an incorrectly decoded piece text.

    You can use the ftfy library to un-do the damage:

    >>> from ftfy import fix_text
    >>> fix_text(s)
    u'\u30e9\u30d6\u30e9\u30a4\u30d6!\u30b9\u30af\u30fc\u30eb\u30a2\u30a4\u30c9\u30eb\u30d5\u30a7\u30b9\u30c6\u30a3\u30d0\u30eb(\u30b9\u30af\u30d5\u30a7\u30b9)'
    >>> print fix_text(s)
    ラブライブ!スクールアイドルフェスティバル(スクフェス)
    

    According to ftfy, your data was encoded as UTF-8, then decoded as Windows codepage 1252; the ftfy.fixes.fix_one_step_and_explain() function shows the repair steps needed:

    >>> ftfy.fixes.fix_one_step_and_explain(s)[-1]
    [(u'encode', u'sloppy-windows-1252', 0), (u'decode', u'utf-8', 0)]
    

    (the 'sloppy' encoding is needed because not all UTF-8 bytes can be decoded as cp1252, but some bad decoders then just copy the original byte; the special codec reverses that process).

    In fact, in your case this was done twice, not a feat I had seen before:

    >>> print s.encode('sloppy-cp1252').decode('utf8').encode('sloppy-cp1252').decode('utf8')
    ラブライブ!スクールアイドルフェスティバル(スクフェス)