Search code examples
javaunicodecharacter-encodingjava-native-interface

What does the JNI documentation mean by "Unicode string"?


The JNI refers to "Unicode strings" and "Unicode characters" in a number of places where a particular encoding must be specified.

This page listing the JNI functions describes several functions as taking or producing "Unicode characters". For example,

NewString

jstring NewString(JNIEnv *env, const jchar *unicodeChars, jsize len);

Constructs a new java.lang.String object from an array of Unicode characters.

I searched the JNI Book for a better description but it left me more confused:

The JNI supports conversion both to and from Unicode and UTF-8 strings. Unicode strings represent characters as 16-bit values [...]

This description confuses me because it suggests that all characters will be encoded in 16-bits, but that isn't enough for Unicode (and it also strangely implies that Unicode and UTF-8 are alternatives). "UTF-16" doesn't appear in the text of the JNI Book. Maybe the JNI docs were written in a more innocent time when there was only the BMP and 16-bits really was enough?

Since jchar is 16 bits, my guess is that "Unicode" here means UTF-16 but I'm not at all sure.

Update: I noticed the wiki page for UTF-16 says "Unicode" is an old term for what we now know as UCS-2. However, it also says Java now uses UTF-16. From that, I still suspect "Unicode" in the JNI docs means standard UTF-16 but I don't usually work with the JNI or even Java so I'd like someone who feels authoritative to chime in.


Solution

  • From that, I still suspect "Unicode" in the JNI docs means standard UTF-16 but I don't usually work with the JNI or even Java so I'd like someone who feels authoritative to chime in.

    That is what it means.

    The JNI Book and the JNI spec were written a long time ago (1999) ... well before the use of code-points outside of the BMP was commonplace.

    (Unicode 2.0 was released in 1996, and it extended Unicode beyond 16 bits. Java adopted Unicode 2.0 in JDK 1.1. However, it would have taken some time before everyone in the Sun Java team switched to using the new, correct terminology.)