Search code examples
c++templatesstllocale

What is std::mbstate_t?


I'm creating a custom locale by deriving from std::codecvt. Most of the methods I'm supposed to implement are pretty straight forward, except for this std::mbstate_t. On my compiler, vs2010, it's declared as an int. But, google tells me it's a POD type, it's sometimes a union (of what I don't know) or a struct (again I can't find it).

As I understand it, std::mbstate_t is a placeholder for partial convertions. And, I think, it comes into play when std::codecvt::on_out() requires more space to write the output, which in turn will call std::codecvt::do_unshift(). Please correct me if my assumptions are wrong.

I've read another post about storing pointers, though the post doesn't have an adequate answer. I've also read this example which presumes it to be a 32bit type although the standard states an int to be no less than 16bits.

My question. What can I safely store in std::mbstate_t? Can I safely replace it with another type? The answer to the above post suggests replacing it, but the following comment says otherwise.


Solution

  • I think that /the/ book concerning these things is C++ IOStreams and Locales by Langer and Kreft, if you seriously want to mess with these things, try to get hold of a copy. Now, coming back to your question, the mbstate_t is used to hold the state of the conversion. Normally, you would store this inside the conversion facet, but since the facets are immutable, you need to store it externally. In practice, that is used when you need more than a sequence of bytes to determine the according character, the Linux manpage of mbsinit() gives ISO-2022 and UTF-7 as examples for such encodings. Note that this does not affect UTF-8, where a single Unicode codepoint is always encoded by a sequence of bytes and without anything before or after that affecting the results. Partial UTF-8 sequences are also not handled by that, do_in() returns partial instead.

    Now, what can you store in the mbstate_t? Since the actual type is undefined and the number of functions to manipulate it are very limited, there is nothing you can do with it at first. However, nothing else does anything with that state either, so you can do some ugly hacking on it. This might require a few #ifdef depending on the standard library but then you can simply (ab)use the fact that it's a POD (ints and unions are also PODs) to store pretty much any type of POD that is not larger. This won't win you a beauty price and the code won't work on any system automatically, but I think in this case it's unavoidable and the work for porting is also limited.

    Finally, can you replace it? This type is part of std::char_traits which in turn affect really all strings and streams, so you need to replace them throughout your program or convert. Further, if you now create a new char_traits class, you still can't easily instantiate e.g. basic_string with it, because there is no guarantee that a general basic_string template even exists, it is only required that the two specializations for char and wchar_t (and some more for C++11) exist. Ditto for streams. In short, no you can't replace mbstate_t.