Let's imagine I have a UTF-8 encoded std::string
containing the following:
óó
and I'd like to convert it to the following:
ÓÓ
Ideally I want the uppercase/lowercase approach I'm using to be generic across all of UTF-8. If that's even possible.
The original byte sequence in the string is 0xc3b3c3b3
(two bytes per character, and two instances of ó
) and I'd like the output to be 0xc393c393
(two instances of Ó
). There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8. It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.
I was expecting to just use something like std::toupper()
, but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string. Also, this Ideone example I put together seems to show that toupper()
of 0xc3b3
is just 0xc3b3
, which is an unexpected result. Calling setlocale
to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.
I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!
There is no standard way to do Unicode case conversion in C++. There are ways that work on some C++ implementations, but the standard doesn't require them to.
If you want guaranteed Unicode case conversion, you will need to use a library like ICU or Boost.Locale (aka: ICU with a more C++-like interface).