From what I've read it seems like a browser must send the x-www-form-urlencoded data in a request in the character set of the form from which the request was generated.
So then, Why do some websites, such as, add ?utf8=%E2%9C%93 (that's ?utf8=✓) to forms? Is this a hack that makes it easier to do something? The character set of that page is UTF-8 already (I checked the headers), so can't it guarantee that the browser will be sending UTF-8? What browsers don't do this? According to w3schools, all major browsers implement accept-charset from forms:
<form accept-charset="UTF-8">
so why isn't that used instead? Or just nothing at all (since the response specifies UTF-8)?
I did some investigating:
In a UTF-8 page, it appears as though searching for 木 (U+6728) gives:
So it's using percent-encoding, which appears to be byte-by-byte encoding hex encoding of whatever the underlying character set is. Well, that definitely works, because this place says that's the UTF-8 encoding. That's good, but it's the simple case, where I'm trying to send UTF-8 data to a UTF-8 page.
Now let's say that I have an ISO-8859-1 page that has a form on it. It's a GET form, and I decide to enter the same 木
for a field. Well, that definitely isn't ISO-8859-1. So Chrome encodes it to
which is then percent-encoded appropriately to %26%2326408%3B
. I verified that IE 8 does the same thing in Windows. So what's the point of the UTF-8 hack?
Related question: Detecting the character encoding of an HTTP POST request
The technique of adding some special characters as hidden data was developed in the old days, when different browsers submitted data in different encodings. It is described e.g. in the document FORM submission and i18n as follows: “the author can add into the form a carefully-crafted "hidden" field which contains a number of diagnostic characters. When this field is submitted, the server can investigate the format of what has been submitted, and reach some conclusions as to what coding the client software was using.”
The technique has lost much of its original relevance, but it is still a cheap way to do some basic correctness checking. It can detect problems e.g. when someone creates a copy of the form and uses it (due to ignorance, carelessness, or other reasons) to submit data so that the encoding is not what it should be.