This is my first experience with unicode, and also with escaping and I'm over my head. The source is a website's pull-down menu and I want to generate a text list of all the items using Python.
From 新北市
I understand that I need to make something that looks like u'\u65B0\u5317\u5E02'
in order to see 新北市 when I print it.
However ''.join([s.replace('&#x', '\u') for s in ''.split(';')])
fails:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
and ''.join([s.replace('&#x', '\\u') for s in '新北市'.split(';')])
(double backslash) gives me '\\u65B0\\u5317\\u5E02'
Quesiton: What expression for mystring
will make `print(mystring)' show '新北市'
Since what you're dealing with are really HTML entities, you can simply parse the input with html.unescape
:
import html
print(html.unescape('新北市'))
This outputs:
新北市