Search code examples
clinuxgccbuildroot

Getting wrong UTF-8 values by casting char into USHORT


This is my first question here, so feel free to criticize or correct me if I am missing important rules.

Recently I was tasked with porting old DOS C-code to a Linux platform. The Font handling is realized by bitfonts. I wrote a function that is capable to draw the selected glyph if you pass the correct Unicode value into it.

However, if I try to cast the char into a USHORT (functions expects this type) I get the wrong value when the character is outside of the ASCII-table.

char* test;
test = "°";

printf("test: %hu\n",(USHORT)test[0]);

The displayed number (console) should be 176 but is instead 194.

If you use "!" the correct value of 33 will be displayed. I made sure that char is unsigned by setting the GCC compiler flag

-unsigned-char

The GCC compiler uses UTF-8 encoding as the default. I really don't know where the issue is right now.

Do I need to add another flag to the compiler?

Update

With the help of @Kninnug answer, I managed to write a code that will produce the desired results for me.

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <wchar.h>
#include <stdint.h>

int main(void)
{
   size_t n = 0, x = 0;
   setlocale(LC_CTYPE, "en_US.utf8");
   mbstate_t state = {0};
   char in[] = "!°水"; // or u8"zß水"
   size_t in_sz = sizeof(in) / sizeof (*in);

   printf("Processing %zu UTF-8 code units: [ ", in_sz);
   for(n = 0; n < in_sz; ++n)
   {
      printf("%#x ", (unsigned char)in[n]);
   }
   puts("]");

   wchar_t out[in_sz];
   char* p_in = in, *end = in + in_sz;
   wchar_t *p_out = out;
   int rc = 0;
   while((rc = mbrtowc(p_out, p_in, end - p_in, &state)) > 0)
   {
       p_in += rc;
       p_out += 1;
   }

   size_t out_sz = p_out - out + 1;
   printf("into %zu wchar_t units: [ ", out_sz);
   for(x = 0; x < out_sz; ++x)
   {
      printf("%u ", (unsigned short)out[x]);
   }
   puts("]");
}

However, when I run this on my embedded device, the non-ASCII characters get merged into one wchar, not into two like on my computer.

I could use single-byte encoding with cp1252 (this worked fine) but I would like to keep using unicode.


Solution

  • A char (signed or unsigned) is a single byte in C 1. (USHORT)test[0] only casts only the first byte in test, but the character in it occupies 2 in the UTF-8 encoding (you can check that with strlen, which counts the number of bytes before the first 0-byte).

    To get the proper code point you need to decode the entire UTF-8 sequence. You can do this with mbrtowc and related functions:

    char* test;
    test = "°";
    int len = strlen(test);
    
    wchar_t code = 0;
    mbstate_t state = {0};
    
    // convert up to len bytes in test, and put the result in code
    // state is used when there are incomplete sequences: pass it to
    // the next call to continue decoding
    mbrtowc(&code, test, len, &state); // you should check the return value
    
    // here the cast is needed, since a wchar_t is not (necessarily) a short
    printf("test: %hu\n", (USHORT)code); 
    

    Side notes:

    • If USHORT is 16 bits (as is commonly the case), it is not strictly enough to cover the entire UTF-8 range, which needs (at least) 21 bits.

    • When you have obtained the proper code point, the cast should not be necessary to pass it to the drawing function. If the function definition or prototype is visible, the compiler can convert the value by itself.


    1 The confusing name comes from the time when all the world's English and all the ASCII code points could fit in a single byte. Hence, a character was the same as a byte.