Section 3.1 Unicode of the JLS states:
The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.
What does "text" refer to?
I'm wondering if it refers to
String
objects?That particular sentence is referring to how text data is represented by Java programs; i.e. String
and related types.
However, one needs to be careful about reading too much into this.
What it really means is that text data is modeled as a sequence of UTF-16 code points. And it is really about how the JLS deals with those aspects of the Java language relate to text handling; i.e. how String literals are modeled. The JLS itself doesn't specify the String
API. That is specified by the javadocs: a separate document. In reality, the JLS only specifies (or implies) that strings have certain properties.
It is no longer literally correct that Java String
objects are represented as UTF-16. Since Java 9 the String
class uses a hybrid representation for string values. Strings are now internally represented using a byte[]
rather than a char[]
. If a string consists solely of LATIN-1 code-points, it is encoded with one byte
per code-unit. If the string contains any code-units outside of the LATIN-1 range, it is encoded in UTF-16.
In short, a String
is modeled by the javadocs as both a sequence of UTF-16 code-units and a sequence of Unicode code-points. The internal representation is more complicated.
A Java application can actually choose to model and represent text anyway it wants to; i.e. any way that makes sense for the application. It doesn't have to use String
or related classes. (Obviously, if an application chooses not to use String
and so on, some things are more complicated. For example, Java's String literal syntax yields only String
objects, and many other APIs require String
values.)
If you take those caveats together, the particular sentence we are talking about is best viewed as explanatory rather than prescriptive.
The Java compiler represents (Java source code) text internally the same way as most other Java program do; i.e. using String
and related types. However that is an implementation detail. So long as Unicode in Java source code is supported properly by the compiler, it doesn't matter how it is represented at compile time.
("Supported properly" means in accordance with whatever the JLS specifies.)