Search code examples
javajls

What does "text" in JLS 3.1 Unicode refer to?


Section 3.1 Unicode of the JLS states:

The Java programming language represents text in sequences of 16-bit code units, using the UTF-16 encoding.

What does "text" refer to?

I'm wondering if it refers to

  • content stored in String objects?
  • the source code as a whole which is passed to the compiler, meaning that this is an instruction for the compiler that the first thing it has to do is to convert the source code internally to UTF-16 before processing it further?

Solution

  • That particular sentence is referring to how text data is represented by Java programs; i.e. String and related types.

    However, one needs to be careful about reading too much into this.

    1. What it really means is that text data is modeled as a sequence of UTF-16 code points. And it is really about how the JLS deals with those aspects of the Java language relate to text handling; i.e. how String literals are modeled. The JLS itself doesn't specify the String API. That is specified by the javadocs: a separate document. In reality, the JLS only specifies (or implies) that strings have certain properties.

    2. It is no longer literally correct that Java String objects are represented as UTF-16. Since Java 9 the String class uses a hybrid representation for string values. Strings are now internally represented using a byte[] rather than a char[]. If a string consists solely of LATIN-1 code-points, it is encoded with one byte per code-unit. If the string contains any code-units outside of the LATIN-1 range, it is encoded in UTF-16.

      In short, a String is modeled by the javadocs as both a sequence of UTF-16 code-units and a sequence of Unicode code-points. The internal representation is more complicated.

    3. A Java application can actually choose to model and represent text anyway it wants to; i.e. any way that makes sense for the application. It doesn't have to use String or related classes. (Obviously, if an application chooses not to use String and so on, some things are more complicated. For example, Java's String literal syntax yields only String objects, and many other APIs require String values.)

    If you take those caveats together, the particular sentence we are talking about is best viewed as explanatory rather than prescriptive.


    The Java compiler represents (Java source code) text internally the same way as most other Java program do; i.e. using String and related types. However that is an implementation detail. So long as Unicode in Java source code is supported properly by the compiler, it doesn't matter how it is represented at compile time.

    ("Supported properly" means in accordance with whatever the JLS specifies.)