The unicode value of ucs-4 character '🤣' is 0001f923
, it gets auto changed to the corresponding value of \uD83E\uDD23
when being copied into java code in intelliJ IDEA.
Java only supports ucs-2, so there occurs a transformation from ucs-4 to ucs-2.
I want to know the logic of the transformation, but didn't find any material about it.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
U+010000 to U+10FFFF
- 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no greater than 0x10FFFF.
- The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.
- The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.
Now with input code point \U1F923:
Programming:
public static void main(String[] args) {
int input = 0x1f923;
int x = input - 0x10000;
int highTenBits = x >> 10;
int lowTenBits = x & ((1 << 10) - 1);
int high = highTenBits + 0xd800;
int low = lowTenBits + 0xdc00;
System.out.println(String.format("[%x][%x]", high, low));
}