Search code examples
pythonpython-3.xstringunicode-escapes

Normal string with "xe2x80x93" "-" char


I have a problem with strings in python3. my var g is a normal string. But in it there is an annoying "xe2x80x93", because it comes from a web parser. I would like to convert this to the fitting character "-".

content = str(urllib.request.urlopen(site, timeout=10).read())
g = content.split('<h1 itemprop="name"')[1].split('</span></h1>')[0].split('<span>')[1].replace("\\", "")

print(type(g)) --> string
print(g)  --> "Flash xe2x80x93 der rote Blitz"

print(g.encode('latin-1').decode('utf-8')) --> AttributeError: 'str' object has no attribute 'decode'
print(repr(g.decode('unicode-escape'))) --> AttributeError: 'str' object has no attribute 'decode'
print(g.encode('ascii','replace')) --> b'Flash xe2x80x93 der rote Blitz'
print(bytes(g, "utf-8").decode()) --> "Flash xe2x80x93 der rote Blitz"
print(bytes(g, "utf-8").decode("unicode_escape")) --> "Flash â der rote Blitz"

How can it works? I dont get any further.


Solution

  • You have the right idea with decode.

    By wrapping the output in str(...) in this line:

    content = str(urllib.request.urlopen(site, timeout=10).read())
    

    You're either converting a bytes object to a string (which will be evident by a leading b' and trailing ' in the content), or, if it's already been decoded as ISO-8859-1, doing nothing.

    In either case, don't do that -- remove the wrapping str call.

    Now, content will be either a bytes object or a str object.

    So if it's a string, it'll be already decoded (incorrectly) as ISO-8859-1. You'll want to encode it back to a bytes object, then decode it correctly:

    content = urllib.request.urlopen(site, timeout=10).read()
    
    if isinstance(content, str):
        content = content.encode('iso-8859-1')
    content = content.decode('utf8')
    

    Now, your \xe2\x80\x93 bytes should properly show up as: –

    Update:

    From your comment, all you need to do is:

    content = urllib.request.urlopen(site, timeout=10).read().decode('utf8')