Search code examples
cstringunicodencurseswidestring

Are there other ways to specify or enter a Unicode code point in C other than using string literals?


In the following program I am trying to provide a Unicode code point to the ncurses function setcchar() as an array string instead of as a string literal. However the output that I'm getting is the first character of the array only, namely the backslash character.

Is there another way to specify a Unicode code point other than as a string literal? And why are the two expressions L"\u4e09" and wcsarr not producing the same result in this context...

#define _XOPEN_SOURCE_EXTENDED 1
#include <curses.h>
#include <stdio.h>
#include <stdlib.h>
#include <locale.h>
#include <wchar.h>
#include <time.h>

int main() {
  setlocale(LC_ALL, "");
  cchar_t kanji;
  wchar_t wcsarr[7];

  wcsarr[0] = L'\\';
  wcsarr[1] = L'u';
  wcsarr[2] = L'4';
  wcsarr[3] = L'e';
  wcsarr[4] = L'0';
  wcsarr[5] = L'9';
  wcsarr[6] = L'\0';

  initscr();

  setcchar(&kanji, wcsarr, WA_NORMAL, 5, NULL);
  addstr("Code point entered as an array string: ");
  add_wch(&kanji);
  addstr("\n");

  setcchar(&kanji, L"\u4e09", WA_NORMAL, 5, NULL);
  addstr("Code point entered as a string literal: ");
  add_wch(&kanji);
  addstr("\n");
  
  refresh();
  getch();
  endwin();

  return EXIT_SUCCESS;
}

Solution

  • An array containing the six characters \u4e09 is an array containing six characters, just as an array containing a backslash followed by an n is an array of two characters, not a newline. The compiler converts escape sequence in literals. Nothing (except what you yourself write) does anything to character arrays.

    So your array wcsarr is not a single wide character. It's a (null-terminated) wide string using six wchar_t values to encode six ascii characters. setcchar requires that its second argument contain only one spacing character (possibly followed by several non-spacing combining characters), and your program does not conform to this specification.

    You could do something like this:

    wchar_t wcsarr[] = {0, 0};
    wcsarr[0] = L'\u4e09`;
    

    If you knew that your locale used Unicode code points as wide character codes, you could write:

    wcsarr[0] = 0x4e09;
    

    since wchar_t, like char, is an integer type. That's occasionally useful if you need to compute a character code (such as non-latin digits), but normally it's considered better style to use wide character literals.

    If you really need to decode a character string containing an escape sequence, you'll need to verify that the syntax is correct and then use something like strtol with the base parameter set to 16. Note, however, that strtol does not have any mechanism to restrict its argument to exactly four digits, so if the escape sequence appears in text where it might be followed by what looks like a hexadecimal digit, you will have to somehow extract it. Either copy it to a temporary buffer, or null-terminate it if the character string can be modified. Or you could write your own hexadecimal decoder; it's not hard.