Search code examples
htmlpostutf-8multipartform-data

How is character encoding specified in a multipart/form-data HTTP POST request?


The HTML 5 specification describes an algorithm for selecting the character encoding to be used in a multi-part form submission (e.g. UTF-8). However, it is not clear how the selected encoding should be relayed to the server so that the content can be properly decoded on the receiving end.

Often, character encodings are represented by appending a "charset" parameter to the value of the Content-Type request header. However, this parameter does not appear to be defined for the multipart/form-data MIME type:

https://www.rfc-editor.org/rfc/rfc7578#section-8

Each part in a multipart form submission may provide its own Content-Type header; however, RFC 7578 notes that "in practice, many widely deployed implementations do not supply a charset parameter in each part, but rather, they rely on the notion of a 'default charset' for a multipart/form-data instance".

RFC 7578 goes on to suggest that a hidden "_charset_" form field can be used for this purpose. However, neither Safari (9.1) nor Chrome (51) appear to populate this field, nor do they provide any per-part encoding information.

I've looked at the request headers produced by both browsers and I don't see any obvious character encoding information. Does anyone know how the browsers are conveying this information to the server?


Solution

  • HTML 5 uses RFC 2388 (obsoleted by RFC 7578), however HTML 5 explicitly removes the Content-Type header from non-file fields, while the RFCs do not:

    The parts of the generated multipart/form-data resource that correspond to non-file fields must not have a Content-Type header specified. Their names and values must be encoded using the character encoding selected above (field names in particular do not get converted to a 7-bit safe encoding as suggested in RFC 2388).

    The RFCs are designed to allow multipart/form-data to be usable in other contexts besides just HTML (though that is its most common use). In those other contexts, Content-Type is allowed. Just not in HTML 5 (but is allowed in HTML 4).

    Without a Content-Type header, the hidden _charset_ form field, if present, is the only way an HTML 5 <form> submitter can explicitly state which charset is used.

    Per the HTML 5 algorithm spec that you linked to, the chosen charset MUST be selected from the <form> element's accept-charset attribute if present, otherwise be the charset used by the HTML itself if it is ASCII-compatible, otherwise be UTF-8. This is explicitly stated in the algorithm spec, as well as in RFC 7578 Section 5.1.2 when referring to HTML 5.

    So, there really is no need for the charset to be explicitly stated by a web browser since the receiver of the form submission should know which charset(s) to expect by virtue of how the <form> was created, and thus can check for those charset(s) while parsing the submission. If the receiver wants to know the specific charset used, it needs to include a hidden _charset_ field in the <form>.