Search code examples
pythonunicodehttplib

How do I post unicode characters using httplib?


I try to post unicode data with the httplib.request function:

s = u"עברית"
data = """
<spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
<text>%s</text>
</spellrequest>
""" % s

con = httplib.HTTPSConnection("www.google.com")
con.request("POST", "/tbproxy/spell?lang=he", data)
response = con.getresponse().read()

However this is my error:

Traceback (most recent call last):
  File "C:\Scripts\iQuality\test.py", line 47, in <module>
    print spellFix(u"╫á╫נ╫¿╫ץ╫ר╫ץ")
  File "C:\Scripts\iQuality\test.py", line 26, in spellFix
    con.request("POST", "/tbproxy/spell?lang=%s" % lang, data)
  File "C:\Python27\lib\httplib.py", line 955, in request
    self._send_request(method, url, body, headers)
  File "C:\Python27\lib\httplib.py", line 989, in _send_request
    self.endheaders(body)
  File "C:\Python27\lib\httplib.py", line 951, in endheaders
    self._send_output(message_body)
  File "C:\Python27\lib\httplib.py", line 815, in _send_output
    self.send(message_body)
  File "C:\Python27\lib\httplib.py", line 787, in send
    self.sock.sendall(data)
  File "C:\Python27\lib\ssl.py", line 220, in sendall
    v = self.send(data[count:])
  File "C:\Python27\lib\ssl.py", line 189, in send
    v = self._sslobj.write(data)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 97-102: or
dinal not in range(128)

Where am I wrong?


Solution

  • http is not defined in terms of a particular character encoding, and instead uses octets. You need to convert your data to an encoding, and then you need to tell the server which encoding you have used. Lets use utf8, since it's usually the best choice:

    This data looks a bit like XML, but you are skipping the xml tag. Some services may accept that, but you shouldn't anyways. In fact, the encoding actually belongs there; so make sure you include it. The heading looks like <?xml version="1.0" encoding="encoding"?>.

    s = u"עברית"
    data_unicode = u"""<?xml version="1.0" encoding="UTF-8"?>
    <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
    <text>%s</text>
    </spellrequest>
    """ % s
    
    data_octets = data_unicode.encode('utf-8')
    

    As a matter of courtesy, you should also tell the server itself the format and encoding, with the content-type header:

    con = httplib.HTTPSConnection("www.google.com")
    con.request("POST",
                "/tbproxy/spell?lang=he", 
                data_octets, {'content-type': 'text/xml; charset=utf-8'})
    

    EDIT: It's working fine on my machine, are you sure you're not skipping something? full example

    >>> from cgi import escape
    >>> from urllib import urlencode
    >>> import httplib
    >>> 
    >>> template = u"""<?xml version="1.0" encoding="UTF-8"?>
    ... <spellrequest textalreadyclipped="0" ignoredups="1" ignoredigits="1" ignoreallcaps="0">
    ... <text>%s</text>
    ... </spellrequest>
    ... """
    >>> 
    >>> def chkspell(word, lang='en'):
    ...     data_octets = (template % escape(word)).encode('utf-8')
    ...     con = httplib.HTTPSConnection("www.google.com")
    ...     con.request("POST",
    ...         "/tbproxy/spell?" + urlencode({'lang': lang}),
    ...         data_octets,
    ...         {'content-type': 'text/xml; charset=utf-8'})
    ...     req = con.getresponse()
    ...     return req.read()
    ... 
    >>> chkspell('baseball')
    '<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="8"></spellresult>'
    >>> chkspell(corpus, 'he')
    '<?xml version="1.0" encoding="UTF-8"?><spellresult error="0" clipped="0" charschecked="5"></spellresult>'
    

    I did notice that when I pasted your example, it appears in the opposite order on my terminal from how it shows in my browser. Not too surprising considering Hebrew is a right-to-left language.

    >>> corpus = u"עברית"
    >>> print corpus[0]
    ע