How can I enforce utf8 in the struts1 taglib?

I've got now a the most wonderful task, the dream of all programmers. There is a roughly 15 year old software here, and I only have to fix "some bugs" in it. 32 bit java6, tomcat6, non-unicode source code, ant build system, and everything what I can only "like".

Note, I have power only over a .war file, thus server-side settings aren't okay.

Solution

Your main problem lies likely in the <bean:message> tag, although also other tags may be problematic.

The Java core supports utf8 since its very early alpha days, but unfortunately there is an exception in the handling of .properties files. These files are interpreted always an iso8859-1 by the JDK API calls.

The Struts1 taglibs use i18n strings addressed by keys, stored in *.properties files. Digging a little bit into the struts1 source, I've found these:

It reads the .properties files with the JDK calls, thus always in iso8859-1. It is deeply hardwired into the code, there is no way to change it.
There is a locale or localeKey parameter in the struts 1, which can be changed by various system.properties or web.xml settings, the .properties will be still read always as iso8859-1. This locale/localekey only adds an extra extension to the actually interpreted properties file.
There is no way to change it, without forking/duplicating the corresponding part of the struts1, and enforcing some non-standard thing into the JDK Properties headers to enforce the standards to their conventions. It is not a very convenient thing in the case of such a relic code.

Although the struts and the other parts of your system (for example, the JSP parser/interpreter) does already some conversion as needed, so this iso8859-1 text will be converted to utf8, if your JSP pages are correctly set up (meta headers and so on).

Furthermore, the property reader uses a - similarly hardwired, undisable - feature, to have a little support for utf8. It accepts utf8 characters in the form \uC0DE. Thus, after a \u or \U (case insensitive), you can give a 16-bit hexa value, which can be and unicode character.

It has to be always 16 bit long, other lengths are not allowed, but these are already case insensitive.

Thus,

my.property.key=árvíztűrő tükörfúrógép

...encoded as utf8, won't work, it will be interpreted as iso8859-1.

You can enter this string as iso8859-1. It can't work, because some of the accents don't have an iso8859-1 mapping, i.e. they don't exist in the iso8859-1 encoding.

However, if you encode it into the above described format:

my.property.key=\u00E1rv\u00EDzt\u0171r\u0151 t\u00FCk\u00F6rf\u00FAr\u00F3g\u00E9p

then yes, it will work!

To do this conversion, the Sun had a native2ascii tool, which is unreachable today. You have to dig this tool from some archive on the net, or find a different one.

On Linux, there is a tool named uni2ascii (on debian-based distributions, you can install it with apt-get install uni2ascii), which does the correct conversion. The correct parameters are:

uni2ascii -a U myfile.properties

The result goes to the stdout.

It is up to you, how do you integrate it into your build system (some ant/maven exec module, or simply use it on change every time, manually).