Search code examples
c++encodingbyte-order-markwstring

Why does string sometimes is written in one direction, sometimes in another?


This is code:

byte bytes[] = {0x2e, 0x20, 0x65, 0x00, 0x74, 0x00, 0x61, 0x00, 0x64, 0x00, 0x70, 0x00, 0x75, 0x00, 0x67, 0x00};
std::wstring s;
s.resize( 8 );
memcpy( &s[0], bytes, 16 );

_tprintf( _T("key: %s\n"), s.c_str());
MessageBox ( 0, s.c_str(), _T(""), 0 );

The result in message box is gupdate in in console ?etadpug.

I think it is with encoding. Does 0x2e20 or 0x202e mean something?


Solution

  • Your bytes are a sequence of chars in UTF-16 (2-byte-per-char encoding).

    It contains the reversed string gupdate after an RTL override mark (which reverses the order of symbols after it).

    Specifically:

    0x2e, 0x20  = U+202E = Right-To-Left override
    0x65, 0x00  = U+0065 = e
    0x74, 0x00  = U+0065 = t
    0x61, 0x00  = U+0074 = a
    etc.
    

    Note how bytes are reversed.

    So, the message box reverses the order of characters, because it is unicode-aware and sees the RTL override mark. Regular console output is not (actually, it is, but that depends on your project settings and the functions you use for IO. In your case it's obviously non-aware version).