Search code examples
pythoncharacter-encodingdecodeencode

Encode Decode of strings python


I have a list of html pages which may contain certain encoded characters. Some examples are as below -

<a href="mailto:lad%20at%20maestro%20dot%20com">
<em>ada&#x40;graphics.maestro.com</em>
<em>mel&#x40;graphics.maestro.com</em>

I would like to decode (escape, I'm unsure of the current terminology) these strings to -

 <a href="mailto:lad at maestro dot com">
<em>[email protected]</em>
<em>[email protected]</em>

Note, the HTML pages are in a string format. Also, I DO NOT want to use any external library like a BeautifulSoup or lxml, only native python libraries are ok.

Edit -

The below solution isn't perfect. HTML Parser unescaping with urllib2 throws a

UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 31: ordinal not in range(128)

error in some cases.


Solution

  • You need to unescape HTML entities, and URL-unquote.
    The standard library has HTMLParser and urllib2 to help with those tasks.

    import HTMLParser, urllib2
    
    markup = '''<a href="mailto:lad%20at%20maestro%20dot%20com">
    <em>ada&#x40;graphics.maestro.com</em>
    <em>mel&#x40;graphics.maestro.com</em>'''
    
    result = HTMLParser.HTMLParser().unescape(urllib2.unquote(markup))
    for line in result.split("\n"): 
        print(line)
    

    Result:

    <a href="mailto:lad at maestro dot com">
    <em>[email protected]</em>
    <em>[email protected]</em>
    

    Edit:
    If your pages can contain non-ASCII characters, you'll need to take care to decode on input and encode on output.
    The sample file you uploaded has charset set to cp-1252, so let's try decoding from that to Unicode:

    import codecs 
    with codecs.open(filename, encoding="cp1252") as fin:
        decoded = fin.read()
    result = HTMLParser.HTMLParser().unescape(urllib2.unquote(decoded))
    with codecs.open('/output/file.html', 'w', encoding='cp1252') as fou:
        fou.write(result)
    

    Edit2:
    If you don't care about the non-ASCII characters you can simplify a bit:

    with open(filename) as fin:
        decoded = fin.read().decode('ascii','ignore')
    ...