Just now I'm trying to use libunistring in my c program.
I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:
void print_length(uint8_t *msg) {
printf("Default strlen: %d\n", strlen((char *)msg));
printf("U8 strlen: %d\n", u8_strlen(msg));
}
Just imagine that we call print_length()
with msg = "привет"
(cyrillic, utf-8 encoding).
I've expected that strlen()
should return 12 (6 letters * 2 bytes per letter), and
u8_strlen()
should return 6 (just 6 letters).
But I recieved curious results:
Default strlen: 12
U8 strlen: 12
After this I'm tried to lookup u8_strlen realization, and found this code:
size_t
u8_strlen (const uint8_t *s)
{
return strlen ((const char *) s);
}
I'm wondering, is it bug or it's correct answer? If it's correct, why?
I believe this is the intended behavior.
The libunistring manual says that:
size_t u8_strlen (const uint8_t *s)
Returns the number of units in s.
Also in the manual, it defines what this "unit" is:
UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).
I believe the reason they label the function u8_strlen
even though it does nothing more than the standard strlen
is that the library also has u16_strlen
and u32_strlen
for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen
simply for completeness.
GNU gnulib does however include mbslen
which probably does what you want:
mbslen function: Determine the number of multibyte characters in a string.