Search code examples
javaunicodecharacter-encodingjavac

How are programs written in Unicode?


From the Java Specification SE 7 Edition

§3.1 Unicode

Programs are written using the Unicode character set.

§3.2 Lexical Translations

A raw Unicode character stream is translated into a sequence of tokens, using the following three lexical translation steps...

I'm confused because I write my source code with my native character encoding (Windows-1252), and the specification mentions that (?) all begins from a raw Unicode character stream, then the lexical translations (Unicode escape conversion included) are performed.

They mention that Unicode escapes can be used to include any Unicode character using only ASCII characters; if a previous conversion is performed, I think they refer to ASCII characters in the subset of the Unicode character set, which makes sense.

Is there a previous conversion from the encoding used to write the source file to Unicode?

Some information related but I think that is more kind of a text handling at runtime, rather than the compilation process:

Converting Non-Unicode Text


Solution

  • Basically what the spec is saying is that you can only use Unicode characters in your source files. It doesn't define how those characters are actually encoded into bytes, that's up to you and the platform you're working on.

    Basically what happens inside the compiler is that a source file is read from disk as a stream of bytes, those bytes are then converted to Java's internal representation of Unicode characters. The way it translates the raw bytes of the source file to Unicode characters is based on the -encoding option passed to javac. If no -encoding option is set it will use your platform's default encoding.

    Now it's also important to note that after the compiler translates the source code bytes into characters it then does another step to convert character literals (e.g. \u00a5123) into the appropriate single Unicode character. This is actually the first of the three steps referenced in section 3.2 that you quoted in your question. This way it's possible using nothing but plain ASCII characters to represent any Unicode character in your source.