Search code examples
apachemod-rewriteurl-encoding

mod_rewrite does not encode special characters even if NE flag is not supplied?


So obviously from apache documentation I see the following description for NE flag: https://httpd.apache.org/docs/2.2/rewrite/flags.html#flag_ne

By default, special characters, such as & and ?, for example, will be converted to their hexcode equivalent. Using the [NE] flag prevents that from happening.

RewriteRule ^/anchor/(.+) /bigpage.html#$1 [NE,R]

The above example will redirect /anchor/xyz to /bigpage.html#xyz. Omitting the [NE] will result in the # being converted to its hexcode equivalent, %23, which will then result in a 404 Not Found error condition.

However, I have seen tons of examples where you simply put a RewriteRule like this:

RewriteRule ^(.*)$ http://www.mydomain.com/?foo=bar&jee=lee [L,R]

And if you examine the final request sent to the server after redirect, it's just this same plain string without any uri encoding. If I experiment more, it seems like uri-encoding only happens inside mod_rewrite if the source string has some special character inside the query string part, say the source is originaldomain.com/?foo%5d=6

Then mod_rewrite will try to rewrite it to mydomain.com/?foo%255d=6 by encoding "%" into "%25", if NE is not supplied. But note if I omit "?" in my original request, the encoding will not happen.

So that makes me confused about what's described in most sites and document, unless I am understanding this concept in a totally wrong way.

And also, I will be curious to learn about in general, what is the rule of thumb that browser and mod_rewrite use to decide whether they want to encode certain characters or not. Seems to me that browser tends not to encode anything unless it finds it hard or does not make sense to send what's being typed in the browser, is that correct? Also it would be really nice if someone can give a complete workflow as to when and where all the encoding and decoding happen from typing the domain in the browser to actually get the page rendered, in the whole process?


Solution

  • The general "rule of thumb" and "complete workflow as to when and where all the encoding and decoding happen" in regard to URIs can be found in RFC3986:

    The generic syntax uses the slash ("/"), question mark ("?"), and
    number sign ("#") characters to delimit components that are
    significant to the generic parser's hierarchical interpretation of an identifier.

    In short, the # symbol when used by most browsers is considered a relative reference. For instance you can add a link to an id on a page with:

    http://www.example.com/mypage.html#some_div_id
    

    Because of this Apache isn't expecting this to be on the server side of things. Therefore by default it's url encoding (their terminology is escaping) the hash symbol to pass it forward when you're doing a rewrite. (It's trying to protect you from yourself according to the RFC.)

    The [NE] or noescape flag basically prevents the default url encoding from taking place.

    Also according to the RFC:

    2.2. Reserved Characters URIs include components and subcomponents that are delimited by
    characters in the "reserved" set. These characters are called
    "reserved" because they may (or may not) be defined as delimiters by
    the generic syntax, by each scheme-specific syntax, or by the
    implementation-specific syntax of a URI's dereferencing algorithm.
    If data for a URI component would conflict with a reserved
    character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

    Additionally from section 1.2.3

    As relative references can only be used within the context of a hierarchical URI, designers of new URI schemes should use a syntax consistent with the generic syntax's hierarchical components unless there are compelling reasons to forbid relative referencing within that scheme.