Search code examples
pythonunicodeapache2cgiunicode-string

Python urllib.request and utf8 decoding question


I'm writing a simple Python CGI script that grabs a webpage and displays the HTML file in the web browser (acting like a proxy). Here is the script:

#!/usr/bin/env python3.0

import urllib.request

site = "http://reddit.com/"
site = urllib.request.urlopen(site)
site = site.read()
site = site.decode('utf8')

print("Content-type: text/html\n\n")
print(site)

This script works fine when run from the command line, but when it gets to viewing it with a web browser, it shows a blank page. Here is the error I get in Apache's error_log:

Traceback (most recent call last):
  File "/home/public/projects/proxy/script.cgi", line 11, in <module>
    print(site)
  File "/usr/local/lib/python3.0/io.py", line 1491, in write
    b = encoder.encode(s)
  File "/usr/local/lib/python3.0/encodings/ascii.py", line 22, in encode
    return codecs.ascii_encode(input, self.errors)[0]
UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 33777: ordinal not in range(128)

Solution

  • When you print it at the command line, you print a Unicode string to the terminal. The terminal has an encoding, so Python will encode your Unicode string to that encoding. This will work fine.

    When you use it in CGI, you end up printing to stdout, which does not have an encoding. Python therefore tries to encode the string with ASCII. This fails, as ASCII doesn't contain all the characters you try to print, so you get the above error.

    The fix for this is to encode your string into some sort of encoding (why not UTF8?) and also say so in the header.

    So something like this:

    sys.stdout.buffer.write(b"Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
    sys.stdout.buffer.write(site.encode('UTF8'))
    

    Under Python 2, this would work as well:

    print("Content-type: text/html;encoding=UTF-8\n\n") # Not 100% sure about the spelling.
    print(site.encode('UTF8'))
    

    But under Python 3 the encoded data in bytes, so it won't print well.

    Of course you'll notice that you now first decode from UTF8 and then re-encode it. You don't need to do that, strictly speaking. But if you want to modify the HTML in between, it may actually be a good idea to do so, and keep all modifications in Unicode.