python python-3.x string unicode-escapes

Normal string with "xe2x80x93" "-" char

I have a problem with strings in python3. my var g is a normal string. But in it there is an annoying "xe2x80x93", because it comes from a web parser. I would like to convert this to the fitting character "-".

content = str(urllib.request.urlopen(site, timeout=10).read())
g = content.split('<h1 itemprop="name"')[1].split('</span></h1>')[0].split('<span>')[1].replace("\\", "")

print(type(g)) --> string
print(g)  --> "Flash xe2x80x93 der rote Blitz"

print(g.encode('latin-1').decode('utf-8')) --> AttributeError: 'str' object has no attribute 'decode'
print(repr(g.decode('unicode-escape'))) --> AttributeError: 'str' object has no attribute 'decode'
print(g.encode('ascii','replace')) --> b'Flash xe2x80x93 der rote Blitz'
print(bytes(g, "utf-8").decode()) --> "Flash xe2x80x93 der rote Blitz"
print(bytes(g, "utf-8").decode("unicode_escape")) --> "Flash â der rote Blitz"

How can it works? I dont get any further.

Solution

You have the right idea with decode.

By wrapping the output in str(...) in this line:

content = str(urllib.request.urlopen(site, timeout=10).read())

You're either converting a bytes object to a string (which will be evident by a leading b' and trailing ' in the content), or, if it's already been decoded as ISO-8859-1, doing nothing.

In either case, don't do that -- remove the wrapping str call.

Now, content will be either a bytes object or a str object.

So if it's a string, it'll be already decoded (incorrectly) as ISO-8859-1. You'll want to encode it back to a bytes object, then decode it correctly:

content = urllib.request.urlopen(site, timeout=10).read()

if isinstance(content, str):
    content = content.encode('iso-8859-1')
content = content.decode('utf8')

Now, your \xe2\x80\x93 bytes should properly show up as: –

Update:

From your comment, all you need to do is:

content = urllib.request.urlopen(site, timeout=10).read().decode('utf8')