Search code examples
c++stringmfccomwidestring

When should we prefer wide-character strings?


I am modernizing a large, legacy MFC codebase which contains a veritable medley of string types:

  • CString
  • std::string
  • std::wstring
  • char*
  • wchar_t*
  • _bstr_t

I'd like to standardize on a single string type internally, and convert to other types only when absolutely required by a third-party API (i.e. COM or MFC functions). The question my coworkers and I are debating; which string type should we standardize on?

I would prefer one of the C++ standard strings: std::string or std::wstring. I'm personally leaning toward std::string, because we do not have any need for wide characters - it is an internal codebase with no customer-facing UI (i.e. no need for multiple-language support). "Plain" strings allow us to use simple, unadorned string literals ("Hello world" vs L"Hello world" or _T("Hello world")).

Is there an official stance from the programming community? When faced with multiple string types, what is typically used as the standard 'internal' storage format?


Solution

  • If we talk about Windows, than I'd use std::wstring (because we often need cool string features), or wchar_t* if you just pass strings around.

    Note Microsoft recommends that here: Working with Strings

    Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters. The Visual C++ compiler supports the built-in data type wchar_t for wide characters

    Also:

    When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings. [...] Internally, the ANSI version translates the string to Unicode.

    Also:

    New applications should always call the Unicode versions. Many world languages require Unicode. If you use ANSI strings, it will be impossible to localize your application. The ANSI versions are also less efficient, because the operating system must convert the ANSI strings to Unicode at run time. [...] Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.