Search code examples
apachetomcatmod-rewriterailolucee

Lucee URI encoding issue (cyrillic)


I just moved one of our core apps from Windows+IIS+Coldfusion to Ubuntu+Apache+Lucee. The first big problem is the URI encoding for exotic alphabets.

For example, trying to reach this url http://www.example.com/ru/Солнцезащитные-очки/saint-laurent/ results in this record in the Apache access.log:

http://www.example.com/ru/%D0%A1%D0%BE%D0%BB%D0%BD%D1%86%D0%B5%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%BD%D1%8B%D0%B5-%D0%BE%D1%87%D0%BA%D0%B8/saint-laurent/

Well, I think that's correctly url-encoded. Then I use a rewrite rule in .htaccess file to get that portion of the url (the cyrillic one) in a url query string parameter (let's say "foo").

Using cflog to dump it, I see in the application log:

/index.cfm?foo=оÑки-длÑ-зÑениÑ&

...which is obviously wrong, because what I need is the original string, in utf-8 cyrillic.

I tried to put URIEncoding parameter in my server.xml tomcat http connector, with no results:

<Connector port="8888" protocol="HTTP/1.1" 
               connectionTimeout="20000" 
               redirectPort="8443" 
                URIEncoding="UTF-8" />

How can I get my url parameter in UTF-8?


Solution

  • I found the solution by myself.

    Source: http://blogs.warwick.ac.uk/kieranshaw/entry/utf-8_internationalisation_with

    Apache

    Generally you don't need to worry about Apache as it shouldn't be messing with your HMTL or URLs. However, if you are doing some proxying with mod_proxy then you might need to have a think about this. We use mod_proxy to do proxying from Apache through to Tomcat. If you've got encoded characters in URL that you need to convert into some query string for your underlying app then you're going to have a strange little problem.

    If you have a URL coming into Apache that looks like this:

    http://mydomain/%E4%B8%AD.doc and you have a mod_rewrite/proxy rule like this:

    RewriteRule ^/(.*) http://mydomain:8080/filedownload/?filename=$1 [QSA,L,P]

    Unfortunately the $1 is going to get mangled during the rewrite. QSA (QueryStringAppend) actually deals with these characters just fine and will send this through untouched, but when you grab a bit of the URL such as my $1 here then the characters get mangled as Apache tries to do some unescaping of its own into ISO-8859-1, but it's UTF-8 not ISO-8859-1 so it doesn't work properly. So, to keep our special characters in UTF-8, we'll escape it back again.

    RewriteMap escape int:escape RewriteRule ^/(.*) http://mydomain:8080/filedownload/?filename=${escape:$1} [QSA,L,P]

    Take a look at your rewrite logs to see if this is working.

    Really hard to find.