I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:
UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.
UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.
Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.
So here is the delemma:
Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.
Most of the time you will use regular string functions for UTF-8 -- strlen
, strcpy
(ick), snprintf
, strtol
. They will work fine with UTF-8 characters. Either use char *
for UTF-8 or you will have to cast everything.
Note that the underscore versions like _mbstowcs
are not standard, they are normally named without an underscore, like mbstowcs
.
It is difficult to come up with examples where you actually want to use operator[]
on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:
If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.
If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).
If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.
Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.
See: C++ iterate or split UTF-8 string into array of symbols?