I am building a string library to support both ascii and utf8.
I create two typedef for t_ascii
and t_utf8
. ascii is safe to be read as utf8, but utf8 is not safe to be read as ascii.
Do I have any way to issue a warning when implicitely casting from t_utf8
to t_ascii
, but not when implicitely casting t_ascii
to t_utf8
?
Ideally, I would want these warnings (and only these warnings) to be issued:
#include <stdint.h>
typedef char t_ascii;
typedef uint_least8_t t_utf8;
int main()
{
t_ascii const* asciistr = "Hello world"; // Ok
t_utf8 const* utf8str = "你好世界"; // Ok
asciistr = utf8str; // Warning: utf8 to ascii is not safe
utf8str = asciistr; // Ok: ascii to utf8 is safe
t_ascii asciichar = 'A';
t_utf8 utf8char = 'B';
asciichar = utf8char; // Warning: utf8 to ascii is not safe
utf8char = asciichar; // Ok: ascii to utf8 is safe
}
Currently, when building with -Wall (and even with -funsigned-char
), I get these warnings:
gcc main.c -Wall -Wextra
main.c: In function ‘main’:
main.c:10:35: warning: pointer targets in initialization of ‘const t_utf8 *’ {aka ‘const unsigned char *’} from ‘char *’ differ in signedness [-Wpointer-sign]
10 | t_utf8 const* utf8str = "你好世界"; // Ok
| ^~~~~~~~~~
main.c:12:18: warning: pointer targets in assignment from ‘const t_utf8 *’ {aka ‘const unsigned char *’} to ‘const t_ascii *’ {aka ‘const char *’} differ in signedness [-Wpointer-sign]
12 | asciistr = utf8str; // Warning: utf8 to ascii is not safe
| ^
main.c:16:17: warning: pointer targets in assignment from ‘const t_ascii *’ {aka ‘const char *’} to ‘const t_utf8 *’ {aka ‘const unsigned char *’} differ in signedness [-Wpointer-sign]
16 | utf8str = asciistr; // Ok: ascii to utf8 is safe
| ^
Compile with -Wall
. Always compile with -Wall
.
<user>@squall:~/src/p1$ gcc -Wall -c test2.c
test2.c: In function ‘main’:
test2.c:9:31: warning: pointer targets in initialization of ‘const t_utf8 *’ {aka ‘const signed char *’} from ‘char *’ differ in signedness [-Wpointer-sign]
9 | t_utf8 const* utf8str = "你好世界";
| ^~~~~~~~~~~~~~
test2.c:11:13: warning: pointer targets in assignment from ‘const t_ascii *’ {aka ‘const char *’} to ‘const t_utf8 *’ {aka ‘const signed char *’} differ in signedness [-Wpointer-sign]
11 | utf8str = asciistr; // Ok: ascii to utf8 is safe
| ^
test2.c:12:14: warning: pointer targets in assignment from ‘const t_utf8 *’ {aka ‘const signed char *’} to ‘const t_ascii *’ {aka ‘const char *’} differ in signedness [-Wpointer-sign]
12 | asciistr = utf8str; // Should issue warning: utf8 to ascii is not safe
| ^
You want it to be safe to cast from t_ascii
from t_utf8
, but it's simply not. The signedness differs.
The warning is not about the fact that valid utf8 is sometimes not valid ASCII - the compiler knows nothing about that. The warning is about the sign.
If you want an unsigned char
, compile with -funsigned-char
. But then neither warning will be issued.
(By the way, if you think that type int_least8_t
will be able to hold a multibyte char / complete utf8 codepoint encoding - it will not. All int_least8_t
and consequently utf8_t
in a single compilation unit will have the exact same size.)