Search code examples
javaunicodecharcodepoint

What is the propper way to get a char's code point?


I need to do some stuff with codepoints and a newline. I have a function that takes a char's codepoint, and if it is \r it needs to behave differently. I've got this:

if (codePoint == Character.codePointAt(new char[] {'\r'}, 0)) {

but that is very ugly and certainly not the right way to do it. What is the correct method of doing this?

(I know that I could hardcode the number 13 (decimal identifier for \r) and use that, but doing that would make it unclear what I am doing...)


Solution

  • If you know that all your input is going to be in the Basic Multilingual Plane (U+0000 to U+FFFF) then you can just use:

    char character = 'x';
    int codePoint = character;
    

    That uses the implicit conversion from char to int, as specified in JLS 5.1.2:

    19 specific conversions on primitive types are called the widening primitive conversions:

    • ...
    • char to int, long, float, or double

    ...

    A widening conversion of a char to an integral type T zero-extends the representation of the char value to fill the wider format.

    However, a char is only a UTF-16 code unit. The point of Character.codePointAt is that it copes with code points outside the BMP, which are composed of a surrogate pair - two UTF-16 code units which join together to make a single character.

    From JLS 3.1:

    The Unicode standard was originally designed as a fixed-width 16-bit character encoding. It has since been changed to allow for characters whose representation requires more than 16 bits. The range of legal code points is now U+0000 to U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are greater than U+FFFF are called supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary characters are represented as pairs of 16-bit code units, the first from the high-surrogates range, (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points and UTF-16 code units are the same.

    If you need to be able to cope with that more complicated situation, you'll need the more complicated code.