Well, I'm really in doubt, how does C work with encodings, well first I have a C file, saved with ISO 8859-1 encoding, with test.c content, when running the program the character ÿ is not displayed correctly on the linux console, I know that by default it uses utf-8, but if utf-8 uses the same 256 characters as ISO 8859-1, why doesn't the program correctly display the 'ÿ' character? Another question, why does test2 correctly display the 'ÿ' character? where the test2.c file is a UTF-8 and also the file.txt is a UTF-8 ? In other words, wasn't the compiler to complain about the width being multi-character?
test1.c
// ISO 8859-1
#include <stdio.h>
int main(void)
{
unsigned char c = 'ÿ';
putchar(c);
return 0;
}
$ gcc -o test1 test1.c
$ ./test1
$ ▒
test2.c
// ASCII
#include <stdio.h>
int main(void)
{
FILE *fp = fopen("file.txt", "r+");
int c;
while((c = fgetc(fp)) != EOF)
putchar(c);
return 0;
}
file.txt: UTF-8 abcdefÿghi
$ gcc -o test2 test2.c
$ ./test2
$ abcdefÿghi
well, that's it, if you can help me giving details about it I would be very grateful, :)
Character encodings can be confusing for many reasons. Here are some explanations:
In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF
(255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ
is encoded in UTF-8 as 0xC3 0xBF
.
When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ
is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ
because the locale selected for the terminal also uses the UTF-8 encoding.
Adding to confusion, if the source file uses UTF-8 encoding, "ÿ"
is a string of length 2 and 'ÿ'
is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra
).
Even more confusing is this: on many systems the type char
signed by default. In this case, the character constant 'ÿ'
(a single byte in ISO 8859-1) has a value of -1
and type int
, no matter how you write it in the source code: '\377'
and '\xff'
will also have a value of -1
. The reason for this is consistency with the value of "ÿ"[0]
, a char
with the value -1
. This is also the most common value of the macro EOF
.
On all systems, getchar()
and similar functions like getc()
and fgetc()
return values between 0
and UCHAR_MAX
or the special negative value of EOF
, so the byte 0xFF from a file where character ÿ
in encoded as ISO 8859-1 is returned as the value 0xFF
or 255
, which compares different from 'ÿ'
if char
is signed, and also different from 'ÿ'
if the source code is in UTF-8.
As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char
unsigned by default (-funsigned-char
).
If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.