Search code examples
cmemorybinarywchar-t

In C, what would happen if I put 'successive wchar_t characters' into a wchar_t variable?


#include <stdio.h>

wchar_t wc = L' 459';
printf("%d", wc);           //result : 32

I know the 'space' is 'decimal 32' in ASCII code table.

What I don't understand is, as far as I know, if there's not enough space for a variable to store value, the value would be the 'last digits' of the original value.

Like, if I put binary value '1100 1001 0011 0110' into single byte variable, it would be '0011 0110' which is 'the last byte' of the original binary value.

But the code above shows 'the first byte' of the original value.

I'd like to know what happen in memory level when I execute the code above.


Solution

  • _int64 x = 0x0041'0042'0043'0044ULL;
    printf("%016llx\n", x);             //prints 0041004200430044
    
    wchar_t wc;
    wc = x;
    printf("%04X\n", wc);               //prints 0044 as you expect
    
    wc = L'\x0041\x0042\x0043\x0044';   //prints 0041, uses the first character
    printf("%04X\n", wc);
    

    If you assign an integer value that's too large, the compiler takes the max value 0x0044 that fits in 2 bytes.

    If you try to assign several elements in to one element, the compiler takes the first element 0x0041 which fits. L'x' is mean to be a single wide character.


    VS2019 will issue a warning for wchar_t wc = L' 459', unless warning level is set to less than 3, but that's not recommended. Use warning level 3 or higher.

    wchar_t is a primitive type, not a typedef for unsigned short, but they are both 2 bytes in Windows (4 bytes in linux)

    Note that 'abcd' is 4 bytes. The L prefix indicates 2 bytes per element (in Windows), so L'abcd' is 8 bytes.

    To see what is inside wc, lets look at Unicode character L'X' which has UTF-16 encoding of 0x0058 (similar to ASCII values up to 128)

    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    
    int main(void)
    {
        wchar_t wc = L'X';
        wprintf(L"%c\n", wc);
        char buf[256];
        memcpy(buf, &wc, 2);
        for (int i = 0; i < 2; i++)
            printf("%02X ", buf[i] & 0xff);
        printf("\n");
        return 0;
    }
    

    The output will be 58 00. It is not 00 58 because Windows runs on little-endian systems and the bytes are flipped.

    Another weird thing is that UTF16 uses for 4 bytes for some code points. So you will get a warning for this line:

    wchar_t wc = L'😀';
    

    Instead you want to use string:

    wchar_t *wstr = L"😀";
    ::MessageBoxW(0, wstr, 0, 0); //console may not display this correctly
    

    This string will be 6 bytes (2 elements + null terminating char)