Search code examples
pythonpython-3.xunicode

Convert unicode series from a webpage to Chinese characters with Python


This is my first experience with unicode, and also with escaping and I'm over my head. The source is a website's pull-down menu and I want to generate a text list of all the items using Python.

From 新北&#x5E02 I understand that I need to make something that looks like u'\u65B0\u5317\u5E02' in order to see 新北市 when I print it.

However ''.join([s.replace('&#x', '\u') for s in ''.split(';')]) fails:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

and ''.join([s.replace('&#x', '\\u') for s in '新北&#x5E02'.split(';')]) (double backslash) gives me '\\u65B0\\u5317\\u5E02'

Quesiton: What expression for mystring will make `print(mystring)' show '新北市'


Solution

  • Since what you're dealing with are really HTML entities, you can simply parse the input with html.unescape:

    import html
    print(html.unescape('新北&#x5E02'))
    

    This outputs:

    新北市