I have a problem with strings in python3. my var g is a normal string. But in it there is an annoying "xe2x80x93", because it comes from a web parser. I would like to convert this to the fitting character "-".
content = str(urllib.request.urlopen(site, timeout=10).read())
g = content.split('<h1 itemprop="name"')[1].split('</span></h1>')[0].split('<span>')[1].replace("\\", "")
print(type(g)) --> string
print(g) --> "Flash xe2x80x93 der rote Blitz"
print(g.encode('latin-1').decode('utf-8')) --> AttributeError: 'str' object has no attribute 'decode'
print(repr(g.decode('unicode-escape'))) --> AttributeError: 'str' object has no attribute 'decode'
print(g.encode('ascii','replace')) --> b'Flash xe2x80x93 der rote Blitz'
print(bytes(g, "utf-8").decode()) --> "Flash xe2x80x93 der rote Blitz"
print(bytes(g, "utf-8").decode("unicode_escape")) --> "Flash â der rote Blitz"
How can it works? I dont get any further.
You have the right idea with decode
.
By wrapping the output in str(...)
in this line:
content = str(urllib.request.urlopen(site, timeout=10).read())
You're either converting a bytes object to a string (which will be evident by a leading b'
and trailing '
in the content
), or, if it's already been decoded as ISO-8859-1, doing nothing.
In either case, don't do that -- remove the wrapping str
call.
Now, content will be either a bytes
object or a str
object.
So if it's a string, it'll be already decoded (incorrectly) as ISO-8859-1. You'll want to encode it back to a bytes object, then decode it correctly:
content = urllib.request.urlopen(site, timeout=10).read()
if isinstance(content, str):
content = content.encode('iso-8859-1')
content = content.decode('utf8')
Now, your \xe2\x80\x93
bytes should properly show up as: –
Update:
From your comment, all you need to do is:
content = urllib.request.urlopen(site, timeout=10).read().decode('utf8')