Search code examples
pythonutf-8tornadoutf

python tornado: encode and decode about url


I'm building a web server with tornado. You can search key words and get reply from the server.

User can type any word such as Chinese or Japanese, so I know that I should use UTF-8.

Here is my core code:

class SearchHandler(tornado.web.RequestHandler):
    def get(self, path):
        try:
            print(self.get_argument('key'))
            print(urllib.parse.unquote(self.get_argument('key'))
            val = urllib.parse.unquote(self.get_argument('key'))
            ...
            ...

Now let's say that an user searched a Chinese word: 泰国
The two print will give me the result as below:

%E6%B3%B0%E5%9B%BD
泰国

At the backend part, I'll use 泰国.

For now everything's fine.

Today I find some weird words in my log: country-cn.html?æ³°å½content enter image description here

Then I copy it into my browser, it show as it looks like:
enter image description here

However, I send the log file to a Windows and open it as txt, it shows a Chinese word: 泰国.

I'm totally confused now. I use my PC (Mac OS) and type 泰国 to visit my web server, everything's fine. But it seems that some guy was trying to search the same Chinese word with a special encoding way that I don't know so I couldn't decode it.


Solution

  • One possibility is that some browsers will default to non-UTF-8 encodings when they can (I'm not sure that's what's happening here because it's most common for latin-1 encodings). Putting a hidden input in your form with a field that can only be represented in UTF-8 will force the browser to use that encoding:

    <input name="utf8" type="hidden" value="&#x2713;" />