Search code examples
javaucs2ucs-4

convert ucs-4 to ucs-2


The unicode value of ucs-4 character '🤣' is 0001f923, it gets auto changed to the corresponding value of \uD83E\uDD23 when being copied into java code in intelliJ IDEA.

Java only supports ucs-2, so there occurs a transformation from ucs-4 to ucs-2.

I want to know the logic of the transformation, but didn't find any material about it.


Solution

  • https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF

    U+010000 to U+10FFFF

    • 0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no greater than 0x10FFFF.
    • The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be in the range 0xD800–0xDBFF.
    • The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2), which will be in the range 0xDC00–0xDFFF.

    Now with input code point \U1F923:

    • \U1F923 - \U10000 = \UF923
    • \UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
    • \UD800 + \U3E = \UD83E
    • \UDC00 + \U123 = \UDD23
    • The result: \UD83E\UDD23

    Programming:

    public static void main(String[] args) {
        int input = 0x1f923;
        int x = input - 0x10000;
    
        int highTenBits = x >> 10;
        int lowTenBits = x & ((1 << 10) - 1);
    
        int high = highTenBits + 0xd800;
        int low = lowTenBits + 0xdc00;
    
        System.out.println(String.format("[%x][%x]", high, low));
    }