Search code examples
c++unicodelocalewchar-twstring

What is mbstate_t and why to reset it?


Could you please explain to me what exactly is mbstate_t? I have read the cppreference description, but I still don't understand its purpose. What I do understand is that mbstate_t is some static struct visible for a limited set of functions like mbtowc(), wctomb() etc., but I am still confused about how to use it. I can see in cppreference examples that this struct should be reset before calling some functions. Assume, I want to count characters in a multi-language string like this one:

std::string str = "Hello! Привет!";

Apparently, str.size() cannot be used in this example, because it simply returns the number of bytes in the string. But something like this does the job:

std::locale::global(std::locale("")); // Linux, UTF-8
std::string str = "Hello! Привет!";
std::string::size_type stringSize = str.size();
std::string::size_type nCharacters = 0;
std::string::size_type nextByte = 0;
std::string::size_type nBytesRead = 0;
std::mbtowc(nullptr, 0, 0); // What does it do, and why is it needed?
while (
    (nBytesRead = std::mbtowc(nullptr, &str[nextByte], stringSize - nextByte))
    != 0)
{
    ++nCharacters;
    nextByte += nBytesRead;
}
std::cout << nCharacters << '\n';

According to cppreference examples, before entering the while loop mbstate_t struct should be reset by calling mbtowc() with all arguments being zeros. What is the purpose of this?


Solution

  • The interface to mbtowc is kind of crazy. A historical mistake, I guess.

    You are not required to pass it a complete string, but can pass a buffer (perhaps a network package) that ends in an incomplete multi-byte character. And then pass the rest of the character in the next call.

    So mbtowc will have to store its current (possibly partial) conversion state between calls. Possibly as a static variable.

    A call to std::mbtowc(nullptr, 0, 0); will clear this internal state, so its is ready for a new string.

    You might want to use mbrtowc instead and provide a non-hidden mbstate_t as an extra parameter.

    https://en.cppreference.com/w/cpp/string/multibyte/mbrtowc