Search code examples
c++unicodestl

Unclear behaviour of std::next_permutation with std::wstring


Given this program to play around with Unicode, I found results that don't make sense to me:

#include <iostream>
#include <algorithm>
#include <cassert>

int main() {
    std::locale::global(std::locale(""));
    std::wcout.imbue(std::locale());
    std::wstring unicodeString = L"⊓⊔⊏";           // FAIL!

// 6 permutations as expected
//  std::wstring unicodeString = L"abc";           // OK

    size_t permutations_count = 0;
    // Print the Unicode string
    do {
        std::wcout << unicodeString << std::endl;
        permutations_count++;
    } while (std::next_permutation(unicodeString.begin(),unicodeString.end()));

    std::wcout << "sizeof char " << sizeof(char) << std::endl;
    std::wcout << "sizeof wchar_t " << sizeof(wchar_t) << std::endl;
    std::wcout << "string size: " << unicodeString.size() << std::endl;
    std::wcout << "permutations: " << permutations_count << std::endl;
    assert(permutations_count == 6);

    return 0;
}

GCC 11.4.0 and Clang 19.0.0git return (g++ -std=c++17 -Wall -Wextra, clang++ -std=c++17 -Wall -Wextra):

⊓⊔⊏
⊔⊏⊓
⊔⊓⊏
sizeof char 1
sizeof wchar_t 4
string size: 3
permutations: 3
a.out: /mnt/c/Workspace/test.cc:26: int main(): Assertion `permutations_count == 6' failed.
Aborted

Which for some reason didn't print all 3! == 6 permutations of "⊓⊔⊏". It works correctly when the string "abc" is provided to next_permutation, however, indicating a mishandling of the multi-byte values by my code somehow.

MSVC CL 19.40.33811 (cl /EHsc /std:c++17) also prints unclear results,

... thousands of lines cut ...
""SSâS?ââ
""SSâSâ?â
""SSâSââ?
""SSS?âââ
""SSSâ?ââ
""SSSââ?â
""SSSâââ?
sizeof char 1
sizeof wchar_t 2
string size: 9
permutations: 6,725
Assertion failed: permutations_count == 6, file test.cc, line 26

There's differences in what naively appears to be a standards-compliant and portable program between the vendors, and none of those behaviours make sense to me. Other notes,

  • My source file is encoded using UTF-8. I tried other ways of writing the unicodeString, like std::string unicodeString = u8"⊓⊔⊏";, but this also returns unexpected results.

  • I assume that wstring is designed to deal with "wide characters" (2 bytes with cl, 4 with g++/clang++) rather than bytes, as seems to be the interface used by the generic algorithms.

Please may someone explain what is being done incorrectly here?


Solution

  • next_permutation only iterates through all permutations if the input was sorted, but ⊓⊔⊏ is not sorted. This is why GCC 11.4.0 and Clang 19.0.0 are not emitting the results you expected.

    MSVC CL 19.40.33811 is a more complex case. The only way I can think of that it ended up with 9 characters is that your cpp file is UTF-8, and thus the string is saved as the bytes 0xe2 0x8a 0x93 0xe2 0x8a 0x94 0xe2 0x8a 0x8f, but then the compiler interpreted the file as some other code page (probably CP-1252), so interpreted each byte as a separate character, and then stored these as separate wchar_t characters in a UTF-16 wstring. This is a common mistake, called Mojibake. A 9 character string would result in 362,880 permutations.... except that your input is not sorted.

    Tangentially to this, when it writes the bytes to the terminal, the terminal is interpreting these bytes as some other encoding (probably CP-1252), and is thus displaying the wrong characters in the terminal.


    Footnote: next_permutation is well defined for any input, and simply moves to the next input in sorted order. When the next input would be "less" than the current state, it returns false. Ergo: if you want to easily iterate over all permutations, then you should start in a sorted state, and iterate until it returns false. But next_permutation is not limited to this case, and it is well defined to use it in other interesting ways.