Search code examples
cstandardswchar-twidecharc-standard-library

How to cast to `wint_t` and to `wchar_t`?


Are the standards saying that casting to wint_t and to wchar_t in the following two programs is guaranteed to be correct?

#include <locale.h>
#include <wchar.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wint_t wc;
  wc = getwchar();
  putwchar((wchar_t) wc);
}

--

#include <locale.h>
#include <wchar.h>
#include <wctype.h>
int main(void)
{
  setlocale(LC_CTYPE, "");
  wchar_t wc;
  wc = L'ÿ';
  if (iswlower((wint_t) wc)) return 0;
  return 1;
}

Consider the case where wchar_t is signed short (this hypothetical implementation is limited to the BMP), wint_t is signed int, and WEOF == ((wint_t)-1). Then (wint_t)U+FFFF is indistinguishable from WEOF. Yes, U+FFFF is a reserved codepoint, but it's still wrong for it to collide.

I would not want to swear that this never happens in real life without an exhaustive audit of existing implementations.

See also May wchar_t be promoted to wint_t?


Solution

  • On the environment you describe, wchar_t cannot accurately describe the BMP: L'\uFEFF' exceeds the range of wchar_t as its type is the unsigned equivalent to wchar_t. (C11 6.4.4.4 Character constants p9). Storing it to wchar_t defined as signed short, assuming 16-bit shorts, changes its value.

    On the other hand, if the charset used for the source code is Unicode and the compiler is properly configured to parse its encoding correctly, L'ÿ' has the value 255 with an unsigned type, so the code in the second example is perfectly defined and unambiguous.

    If int is 32-bit wide and short 16-bit wide, it seems much more consistent to define wchar_t as either int or unsigned short. WEOF can then be defined as (-1), a value different from all values of wchar_t or at least all values representing Unicode code-points.