Search code examples
tomcatencodinguriuriencodingiri

What is the point of Tomcat's setting URIEncoding?


In Apache Tomcat, parameter URIEncoding tells Tomcat how to interpret incoming URIs:

URIEncoding

This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, ISO-8859-1 will be used.

Apache Tomcat 7 - The HTTP Connector

However, as explained for example in What is the proper way to URL encode Unicode characters? , non-ASCII characters in URIs are always encoded in UTF-8, following current standards (RFC 3986 and 3987).

So:

  • Why is there even a setting for something that is mandated by a standard?
  • Why is the default different from what the standard mandates? (ISO-8859-1 instead of UTF-8)

Is this simply because the Tomcat setting predates the standard, and was retained for backwards compatibility? Or is there some situation where a value different from UTF-8 makes sense?


Solution

  • The description of parameter URIEncoding in Tomcat 8 - Apache Tomcat 8 - The HTTP Connector:

    This specifies the character encoding used to decode the URI bytes, after %xx decoding the URL. If not specified, UTF-8 will be used unless the org.apache.catalina.STRICT_SERVLET_COMPLIANCE system property is set to true in which case ISO-8859-1 will be used.

    Thus the description was changed from that of Apache Tomcat 7. The default value of org.apache.catalina.STRICT_SERVLET_COMPLIANCE is false from Apache Tomcat 8. So UTF-8 is the default value of URIEncoding for Apache Tomcat 8, which means that Tomcat now follows the standard (and common usage).


    As to why Tomcat used ISO 8859-1 as the default URI encoding until Tomcat 7:

    That seems to be because the Tomcat devevelopers believed this to be what the Servlet specification requires (as the name of the setting STRICT_SERVLET_COMPLIANCE indicates).

    As a matter of fact, the Servlet spec (before V4.0) does not explicitly mention URI encoding in any version. It does, however, mention that POST data must be parsed as ISO 8859-1 if the Content-Type HTTP header does not specify an encoding via charset (Servlet Specification V2.5, "Request data encoding"). Apparently this was interpreted to mean that query parameters (and thus the whole URI) should also be decoded as ISO 8859-1 by default.

    The root problem is arguably that the Servlet Specification did not specify the default encoding to use for decoding URIs, let alone a way to change this encoding. This in turn is probably because the URI spec originally did not allow for non-ASCII characters in URIs - this was only standardized by introducing IRIs, see RFC 3987 from January 2005. Therefore every servlet container had to come up with their own default value and configuration parameter, such as URIEncoding in Apache Tomcat.

    These two problems have been reported as bugs against the Servlet Specification:

    For V4.0, the Servlet spec was amended to specify UTF-8 encoding for request URLs:

    12.1 Use of URL Paths

    [...] The request URL is decoded as a UTF-8 encoded string.

    Java™ Servlet Specification, Version 4.0