Search code examples
character-encodingurl-encodingorbeonxforms

Specify character encoding for query strings with Orbeon


We're encountering a character encoding issue when reading a UTF-8 query string. An separate outside application is constructs links to our Orbeon application such as:

  • http://localhost:8080/ops/encoding-test/?message=hello%20world
  • http://localhost:8080/ops/encoding-test/?message=it%E2%80%99s%20a%20message

Our application's model reading the query string with the oxf:request processor, and then displaying the string in a view. In the first case above, the application displays "hello world" correctly without problems. In the second test case, %E2%80%99 is the URL encoding for a UTF-8 apostrophe, and causes the application to error with:

2012-09-13 12:21:43,383 ERROR XSLTTransformer  - Error at line 174 of oxf:/config/theme-examples.xsl:
Illegal HTML character: decimal 128
2012-09-13 12:21:43,384 ERROR ProcessorService  - Exception at line 174 of oxf:/config/theme-examples.xsl
; SystemID: oxf:/config/theme-examples.xsl; Line#: 174; Column#: -1
org.orbeon.saxon.trans.XPathException: Illegal HTML character: decimal 128

The error is referencing the %80 in the second byte of the multi-byte encoding of the apostrophe. Note that in the log not only does the theme raise an exception, but the xforms inspector does as well.

It appears like the URL is being decoded as Latin1 instead of UTF-8, as the debug processor lists it???s a message with three characters for the apostrophe. In my research so far, it doesn't appear that HTTP has a way to specify the encoding of the query string itself.

  1. Is there a way to specify the encoding of a query string when read with oxf:request? I didn't see a configuration property for the processor or anything relevant in properties-local.xml that would set a default.
  2. If not, is there a way to force the associated encoding of the string? I suspect this could be done with XSLT, but was unable to find an example. I believe I want something equivalent to ruby's String#force_encoding.
  3. If not, is there any other suggested way to work around the error? My current worst-case hack-fix here is to just strip out any offending characters using mod_rewrite before it hits the servlet.

Any guidance and assistance is appreciated!

(cross posted to ops-users mailing list at http://mail-archive.ow2.org/ops-users/2012-09/msg00033.html)


Solution

  • Orbeon Forms relies on what is returned by the servlet API: see getParameterMap() in ServletExternalContext. So this seems to be something you need to set at the application server level; if using Tomcat, you can do so by adding URIEncoding="UTF-8" on the <Connector>.