Search code examples
pythonpycurlstringio

Python(2.6) cStringIO unicode support?


I'm using python pycurl module to download content from various web pages. Since I also wanted to support potential unicode text I've been avoiding the cStringIO.StringIO function which according to python docs: cStringIO - Faster version of StringIO

Unlike the StringIO module, this module is not able to accept Unicode strings that cannot be encoded as plain ASCII strings.

... does not support unicode strings. Actually it states that it does not support unicode strings that can not be converted to ASCII strings. Can someone please clarify this to me? Which can and which can not be converted?

I've tested with the following code and it seems to work just fine with unicode:

import pycurl
import cStringIO

downloadedContent = cStringIO.StringIO()
curlHandle = pycurl.Curl()
curlHandle.setopt(pycurl.WRITEFUNCTION, downloadedContent.write)
curlHandle.setopt(pycurl.URL, 'http://www.ltg.ed.ac.uk/~richard/unicode-sample.html')

curlHandle.perform()
content = downloadedContent.getvalue()

fileHandle = open('unicode-test.txt','w')
for char in content:
    fileHandle.write(char)

And the file is correctly written. I can even print the whole content in the console, all characters show up fine... So what I'm puzzled about is, where does the cStringIO fail ? Is there any reason why I should not use it?

[Note: I'm using Python 2.6 and need to stick to this version]


Solution

  • Any text that only uses ASCII codepoints (byte values 00-7F hexadecimal) can be converted to ASCII. Basically any text that uses characters not often used in American English is not ASCII.

    In your example code, you are not converting the input to Unicode text; you are treating it as un-decoded bytes. The test page in question is encoded in UTF-8, and you never decode that to Unicode.

    If you were to decode the value to a Unicode string, you won't be able to store that string in a cStringIO object.

    You may want to read up on the difference between Unicode and text encodings such as ASCII and UTF-8. I can recommend: