Search code examples
pythonpython-2.7encodingasciinon-ascii-characters

Python encoding/decoding problems


How do I decode strings such as this one "weren\xe2\x80\x99t" back to the normal encoding.

So this word is actually weren't and not "weren\xe2\x80\x99t"? For example:

print "\xe2\x80\x9cThings"
string = "\xe2\x80\x9cThings"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

“Things
“Things
Things

But I actually want to get "Things.

or:

print "weren\xe2\x80\x99t"
string = "weren\xe2\x80\x99t"
print string.decode('utf-8')
print string.encode('ascii', 'ignore')

weren’t
weren’t
werent

But I actually want to get weren't.

How should i do this?


Solution

  • I mapped the most common strange chars so this is pretty much complete answer based on the Oliver W. answer.

    This function is by no means ideal,but it is the best place to start with. There are more chars definitions:

    http://utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string
    http://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&names=-&utf8=string-literal

    ...

    def unicodetoascii(text):
    
        uni2ascii = {
                ord('\xe2\x80\x99'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
                ord('\xe2\x80\x9d'.decode('utf-8')): ord('"'),
                ord('\xe2\x80\x9e'.decode('utf-8')): ord('"'),
                ord('\xe2\x80\x9f'.decode('utf-8')): ord('"'),
                ord('\xc3\xa9'.decode('utf-8')): ord('e'),
                ord('\xe2\x80\x9c'.decode('utf-8')): ord('"'),
                ord('\xe2\x80\x93'.decode('utf-8')): ord('-'),
                ord('\xe2\x80\x92'.decode('utf-8')): ord('-'),
                ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
                ord('\xe2\x80\x94'.decode('utf-8')): ord('-'),
                ord('\xe2\x80\x98'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\x9b'.decode('utf-8')): ord("'"),
    
                ord('\xe2\x80\x90'.decode('utf-8')): ord('-'),
                ord('\xe2\x80\x91'.decode('utf-8')): ord('-'),
    
                ord('\xe2\x80\xb2'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\xb3'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\xb4'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\xb5'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\xb6'.decode('utf-8')): ord("'"),
                ord('\xe2\x80\xb7'.decode('utf-8')): ord("'"),
    
                ord('\xe2\x81\xba'.decode('utf-8')): ord("+"),
                ord('\xe2\x81\xbb'.decode('utf-8')): ord("-"),
                ord('\xe2\x81\xbc'.decode('utf-8')): ord("="),
                ord('\xe2\x81\xbd'.decode('utf-8')): ord("("),
                ord('\xe2\x81\xbe'.decode('utf-8')): ord(")"),
    
                                }
        return text.decode('utf-8').translate(uni2ascii).encode('ascii')
    
    print unicodetoascii("weren\xe2\x80\x99t")