I have a win32 project in which I'm trying to edit the characters of a WCHAR string with a custom function.
I know this stands for Wide Char and is Unicode, however I don't fully grasp how the encoding works. For example, I know UTF-8 also holds Unicode, but is it the same as a WCHAR?
I assumed the string would look something like
00 43 00 4f 00 44 00 45 00 00
C O D E \0
And for copying it works fine to just assume the string is twice as long. However I am getting errors when, say, searching for a character, for example:
for(int i = wcslen(inStr) - 2; i >= 0; i--) {
WCHAR current[] = {inStr[i], inStr[i + 1], 0, 0};
if(current == _T("/")) {
pos = i;
break;
}
}
Produces some corrupted errors. Am I making this too complicated? I understand there's probably many functions to do this, but I'd like to understand how it works, so I can make for efficient code. Thanks
The specific problem you’re having is that current[n]
is the nth element in the array, not the nth byte of the array. Doing pointer arithmetic like current + n
also gives you the nth element after the one current
points to. The same is true if you declare an array of int
, double
, some struct
or anything else.
So, when you declare an array wchar_t a[] = L"!"
, then take wcslen(a)
, you get back the count of wide characters in the array, 1. If you try to set i = wcslen(a) - 2;
and then take a[i]
, i
will be -1, which is a serious bug.
On Windows, WCHAR
is an alias for the standard type wchar_t
. You don’t say whether you’re writing in C or C++. There are a number of functions in the C standard library to manipulate wide-character strings, in <wchar.h>
and <wctype.h>
. The C++ standard library has all of these, as well as std::wstring
in <string>
and wide-character streams including std::wcout
, std::wcin
and std::wcerr
(although Windows doesn’t fully support them). Most Windows API functions also can accept wide-character strings. The standard type of a wide character string is wchar_t*
, but WCHAR*
, LPWSTR
and, by default on modern versions of Visual Studio, TCHAR*
and LPTSTR
also work.
On Windows, wide characters are little-endian UTF-16. This is not portable, but then, neither is WCHAR
. On some other systems, wide characters are either big-endian UTF-16, or big- or little-endian UTF-32. In C, the standard types char16_t
and char32_t
are defined in <uchar.h>
. In C++, they are built into the language. If you try to pass a char16_t*
to a function that expects a wchar_t*
, it won’t work without a cast, or on targets other than Windows at all.
UTF-8 is a way of storing Unicode code points that’s backwards-compatible with seven-bit ASCII. UTF-8 is an alternative representation from UTF-16 or UTF-32. A UTF-8 string will be stored in an array of unsigned char
or char
, with one Unicode code point potentially needing several bytes to store it. Actually, because of surrogate pairs, a Unicode code point potentially needs two UTF-16 objects to encode it, as well. There are some times when it’s convenient to use a different representation (UTF-16LE is what the Windows ABI expects and what some libraries like ICU and QT use internally, and UTF-32 is the only representation that guarantees all Unicode characters will fit into a single element), but my advice is to use UTF-8 whenever you can and some other encoding whenever you have to.
If you want to read backwards through a wide string, you might try this:
int i = wcslen(inStr); // Could be 0.
if (i > 0) { // Don't read one element past the start of the array.
do {
--i;
} while ( i > 0 && inStr[i] != L'/' );
}
/* When we reach this line, i is either 0 or the index of the last slash
* in inStr, which could also be 0. We can test whether inStr[i] == L'/' or
* write an if() within our loop to do something more complicated.
*/