c text character-encoding non-ascii-characters

c reading non ASCII characters

I am parsing a file that involves characters such as æ ø å. If we assume I have stored a line of the text file as follows

#define MAXLINESIZE 1024
char* buffer = malloc(MAXLINESIZE)
...
fgets(buffer,MAXLINESIZE,handle)
...

if I wanted to count the number of characters on a line. If I try to do the following:

char* p = buffer
int count = 0;
while (*p != '\n') {
    if (isgraph(*p)) {
        count++;
    }
    p++;
}

this ignores the any occurrence of æ ø å

ie: counting "aåeæioøu" would return 5 not 8

do I need to read the file in an alternative way? should I not be using a char* but an int*?

Solution

You need to understand which encoding is used for your characters. I guess it is very probably UTF-8 (and you should use UTF8 everywhere....), read Joel's blog on Unicode. If your encoding is not UTF-8 you should convert it to UTF-8 e.g. using libiconv.

Then you need a C library for UTF-8. There are many of them (but none is standardized in the C11 language yet). I recommend libunistring or glib (from GTK), but see also this.

Your code will change, since an UTF-8 character can take one to four [8 bits] bytes (but Wikipedia UTF-8 page mentions 6 bytes at most; See Unicode standards for details). You won't test if a byte (i.e. a plain C char) is a letter, but if a byte and the few bytes after it (given by a pointer, i.e. a char* or better by uint8_t*) encode a letter (including cyrillic letters, etc..).

Not every sequence of bytes is a valid UTF-8 representation, and you might want to validate a line (or a null-terminated C string) before analyzing it.