Search code examples
pythonparsingunicodehtml-parsingascii

Python Unencode unicode html hexadecimal


Suppose I have strings with lots of stuff like

“words words words

Is there a way to convert these through python directly into the characters they represent?

I tried

h = HTMLParser.HTMLParser()
print h.unescape(x)

but got this error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

I also tried

print h.unescape(x).encode(utf-8) 

but it encodes

“ as â

when it should be a quote


Solution

  • “ form a UTF-8 byte sequence, for the U+201C LEFT DOUBLE QUOTATION MARK character. Something is majorly mucked up there. The correct encoding would have been “.

    You can use the HTML parser to unescape this, but you'll need to repair the resulting Mochibake:

    >>> import HTMLParser
    >>> h = HTMLParser.HTMLParser()
    >>> x = '“'
    >>> h.unescape(x)
    u'\xe2\x80\x9c'
    >>> h.unescape(x).encode('latin1')
    '\xe2\x80\x9c'
    >>> h.unescape(x).encode('latin1').decode('utf8')
    u'\u201c'
    >>> print h.unescape(x).encode('latin1').decode('utf8')
    “
    

    If printing still gives you a UnicodeEncodeError, then your terminal or console is incorrectly configured and Python is inadventently encoding to ASCII.