python django urllib2 httplib pivotaltracker

How do I post non-ASCII characters using httplib when content-type is "application/xml"

I've implemented a Pivotal Tracker API module in Python 2.7. The Pivotal Tracker API expects POST data to be an XML document and "application/xml" to be the content type.

My code uses urlib/httplib to post the document as shown:

    request = urllib2.Request(self.url, xml_request.toxml('utf-8') if xml_request else None, self.headers)
    obj = parse_xml(self.opener.open(request))

This yields an exception when the XML text contains non-ASCII characters:

File "/usr/lib/python2.7/httplib.py", line 951, in endheaders
  self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 809, in _send_output
  msg += message_body
exceptions.UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 89: ordinal not in range(128)

As near as I can see, httplib._send_output is creating an ASCII string for the message payload, presumably because it expects the data to be URL encoded (application/x-www-form-urlencoded). It works fine with application/xml as long as only ASCII characters are used.

Is there a straightforward way to post application/xml data containing non-ASCII characters or am I going to have to jump through hoops (e.g. using Twistd and a custom producer for the POST payload)?

Solution

You're mixing Unicode and bytestrings.

>>> msg = u'abc' # Unicode string
>>> message_body = b'\xc5' # bytestring
>>> msg += message_body
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal \
not in range(128)

To fix it, make sure that self.headers content is properly encoded i.e., all keys, values in the headers should be bytestrings:

self.headers = dict((k.encode('ascii') if isinstance(k, unicode) else k,
                     v.encode('ascii') if isinstance(v, unicode) else v)
                    for k,v in self.headers.items())

Note: character encoding of the headers has nothing to do with a character encoding of a body i.e., xml text can be encoded independently (it is just an octet stream from http message's point of view).

The same goes for self.url—if it has the unicode type; convert it to a bytestring (using 'ascii' character encoding).

HTTP message consists of a start-line, "headers", an empty line and possibly a message-body so self.headers is used for headers, self.url is used for start-line (http method goes here) and probably for Host http header (if client is http/1.1), XML text goes to message body (as binary blob).

It is always safe to use ASCII encoding for self.url (IDNA can be used for non-ascii domain names—the result is also ASCII).

Here's what rfc 7230 says about http headers character encoding:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

To convert XML to a bytestring, see application/xml encoding condsiderations:

The use of UTF-8, without a BOM, is RECOMMENDED for all XML MIME entities.