Search code examples

How to do string operations with Win32 WCHAR

I have a win32 project in which I'm trying to edit the characters of a WCHAR string with a custom function.

I know this stands for Wide Char and is Unicode, however I don't fully grasp how the encoding works. For example, I know UTF-8 also holds Unicode, but is it the same as a WCHAR?

I assumed the string would look something like

00 43 00 4f 00 44 00 45 00 00
    C     O     D     E    \0

And for copying it works fine to just assume the string is twice as long. However I am getting errors when, say, searching for a character, for example:

for(int i = wcslen(inStr) - 2; i >= 0; i--) {
    WCHAR current[] = {inStr[i], inStr[i + 1], 0, 0};
    if(current == _T("/")) {
        pos = i;

Produces some corrupted errors. Am I making this too complicated? I understand there's probably many functions to do this, but I'd like to understand how it works, so I can make for efficient code. Thanks


  • Shorter Answer

    The specific problem you’re having is that current[n] is the nth element in the array, not the nth byte of the array. Doing pointer arithmetic like current + n also gives you the nth element after the one current points to. The same is true if you declare an array of int, double, some struct or anything else.

    So, when you declare an array wchar_t a[] = L"!", then take wcslen(a), you get back the count of wide characters in the array, 1. If you try to set i = wcslen(a) - 2; and then take a[i], i will be -1, which is a serious bug.

    Longer Explanation

    On Windows, WCHAR is an alias for the standard type wchar_t. You don’t say whether you’re writing in C or C++. There are a number of functions in the C standard library to manipulate wide-character strings, in <wchar.h> and <wctype.h>. The C++ standard library has all of these, as well as std::wstring in <string> and wide-character streams including std::wcout, std::wcin and std::wcerr (although Windows doesn’t fully support them). Most Windows API functions also can accept wide-character strings. The standard type of a wide character string is wchar_t*, but WCHAR*, LPWSTR and, by default on modern versions of Visual Studio, TCHAR* and LPTSTR also work.

    On Windows, wide characters are little-endian UTF-16. This is not portable, but then, neither is WCHAR. On some other systems, wide characters are either big-endian UTF-16, or big- or little-endian UTF-32. In C, the standard types char16_t and char32_t are defined in <uchar.h>. In C++, they are built into the language. If you try to pass a char16_t* to a function that expects a wchar_t*, it won’t work without a cast, or on targets other than Windows at all.

    UTF-8 is a way of storing Unicode code points that’s backwards-compatible with seven-bit ASCII. UTF-8 is an alternative representation from UTF-16 or UTF-32. A UTF-8 string will be stored in an array of unsigned char or char, with one Unicode code point potentially needing several bytes to store it. Actually, because of surrogate pairs, a Unicode code point potentially needs two UTF-16 objects to encode it, as well. There are some times when it’s convenient to use a different representation (UTF-16LE is what the Windows ABI expects and what some libraries like ICU and QT use internally, and UTF-32 is the only representation that guarantees all Unicode characters will fit into a single element), but my advice is to use UTF-8 whenever you can and some other encoding whenever you have to.

    Possible solution

    If you want to read backwards through a wide string, you might try this:

    int i = wcslen(inStr); // Could be 0.
    if (i > 0) { // Don't read one element past the start of the array.
      do {
      } while ( i > 0 && inStr[i] != L'/' );
    /* When we reach this line, i is either 0 or the index of the last slash
     * in inStr, which could also be 0.  We can test whether inStr[i] == L'/' or
     * write an if() within our loop to do something more complicated.