Search code examples
clinuxunicodelocaleglibc

Unix: why does reading wide characters in C stop after ASCII?


characters.txt has the content (output from od -c):

0000000   %   (   )   *   +   ,   -   .   /   0   1   2   3   4   5   6
0000020   7   8   9   <   =   >   ?   [   ]  \n   A   B   C   D   E   F
0000040   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V
0000060   W   X   Y   Z  \n   a   b   c   d   e   f   g   h   i   j   k
0000100   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z  \n
0000120 316 223 316 224 316 230 316 233 316 236 316 243 316 246 316 250
0000140 316 251 316 261 316 262 316 263 316 264 316 265 316 266 316 267
0000160 316 270 316 271 316 272 316 273 316 274 316 275 316 276 316 277
0000200 317 200 317 201 317 202 317 203 317 204 317 205 317 206 317 207
0000220 317 210 317 211  \n

That is, some ASCII followed by some Greek in UTF-8. I want to read these characters with (the following which is written after examples given in the glibc info pages)

wint_t* read_characters() {
    char *filename = "characters.txt";
    FILE *infile;
    infile = fopen (filename, "rb");
    printf ("File orientation: %d\n", fwide (infile,0));
    static wint_t b[16384], c, *p;
    p = b;
    while ((p-b)<sizeof(b)-4 && (c = fgetwc (infile)) != WEOF)
        *p++ = c;
    *p++ = WEOF;
    printf("\nRead %ld wint_t chars from characters.txt\n", p-b);
    return b;
}

The output is:

File orientation: 0 Read 81 wint_t chars from characters.txt

This means reading stopped with the first Greek character. Why? I'm using no signed variable that could fake a WEOF. Who can help?


Solution

  • The solution (hinted at by n.m.) was to include this call

    setlocale(LC_ALL, "en_US.UTF-8");
    

    and this is necessary even if LC_ALL is set globally, because C programs always start out in the "C" locale. If you want to use something else you always have to set it.