I'm building a web server with tornado. You can search key words and get reply from the server.
User can type any word such as Chinese or Japanese, so I know that I should use UTF-8.
Here is my core code:
class SearchHandler(tornado.web.RequestHandler):
def get(self, path):
try:
print(self.get_argument('key'))
print(urllib.parse.unquote(self.get_argument('key'))
val = urllib.parse.unquote(self.get_argument('key'))
...
...
Now let's say that an user searched a Chinese word: 泰国
The two print
will give me the result as below:
%E6%B3%B0%E5%9B%BD
泰国
At the backend part, I'll use 泰国
.
For now everything's fine.
Today I find some weird words in my log:
country-cn.html?æ³°å½content
Then I copy it into my browser, it show as it looks like:
However, I send the log file to a Windows and open it as txt
, it shows a Chinese word: 泰国.
I'm totally confused now. I use my PC (Mac OS) and type 泰国
to visit my web server, everything's fine. But it seems that some guy was trying to search the same Chinese word with a special encoding way that I don't know so I couldn't decode it.
One possibility is that some browsers will default to non-UTF-8 encodings when they can (I'm not sure that's what's happening here because it's most common for latin-1
encodings). Putting a hidden input in your form with a field that can only be represented in UTF-8 will force the browser to use that encoding:
<input name="utf8" type="hidden" value="✓" />