python python-3.x character-encoding tornado

Tornado Invalid x-www-form-urlencoded body: 'latin-1' codec can't encode characters in position 774-777: ordinal not in range(256)

I'm using tornado to accept some data sended from clients I don't have access to. Everything works fine if only English characters appear in the data. When utf-8 encoded Chinese characters(3 bytes) are within the data, Tornado gives me this warning and the 'get_argument' function can't get anything at all.

I debuged and simplified my code to the simplest, yet the warning still comes up

class DataHandler(tornado.web.RequestHandler):
    def post(self):
        print("test")
        print(self.get_argument("data"))
        print("1")

application = tornado.web.Application([
    (r"/data", Data),
])

application.listen(5000)
tornado.ioloop.IOLoop.instance().start()

The data's format looks like this:

data={"id":"00f1c423","mac":"11:22:33:44:55:66"}

The data is x-www-form-urlencoded and WireShark shows the Chinese characters are perfectly 3-bytes utf-8 which starts with E(1110). The position mentioned in the warning(774-777) is where the Chinese characters begins and it's always 5 bytes, despite the changing of Chinese characters.

I'm confused about the 'encode' in the warning. I actually did nothing about encoding in my code, so I presume it's what Tornado does within the RequestHandler class. But since Tornado defaults to use utf-8 codec, where does this latin-1 come from? And most importantly, how can I fix it?

Solution

This won't be a problem anymore. Tornado did some changes to support x-www-form-urlencoded body with values consisting of encoded bytes which are not url-encoded into ascii.

See: tornado merge request

Also: github issue #2733