Search code examples
cunicodeutf-8

libunistring u8_strlen() equals to strlen()?


Just now I'm trying to use libunistring in my c program. I've to process UTF-8 string, and for it I used u8_strlen() function from libunistring library.
Code example:

void print_length(uint8_t *msg) {
    printf("Default strlen: %d\n", strlen((char *)msg));
    printf("U8 strlen: %d\n", u8_strlen(msg));
}

Just imagine that we call print_length() with msg = "привет" (cyrillic, utf-8 encoding). I've expected that strlen() should return 12 (6 letters * 2 bytes per letter), and u8_strlen() should return 6 (just 6 letters).

But I recieved curious results:

Default strlen: 12
U8 strlen: 12

After this I'm tried to lookup u8_strlen realization, and found this code:

size_t
u8_strlen (const uint8_t *s)
{
    return strlen ((const char *) s);
}

I'm wondering, is it bug or it's correct answer? If it's correct, why?


Solution

  • I believe this is the intended behavior.

    The libunistring manual says that:

    size_t u8_strlen (const uint8_t *s)

    Returns the number of units in s.

    Also in the manual, it defines what this "unit" is:

    UTF-8 strings, through the type ‘uint8_t *’. The units are bytes (uint8_t).

    I believe the reason they label the function u8_strlen even though it does nothing more than the standard strlen is that the library also has u16_strlen and u32_strlen for operation on UTF-16 and UTF-32 strings, respectively (which would count the number of 2-byte units until 0x0000, and 4-byte units until 0x00000000), and they included u8_strlen simply for completeness.

    GNU gnulib does however include mbslen which probably does what you want:

    mbslen function: Determine the number of multibyte characters in a string.