Search code examples
c++unicodeutf

Is handling unicode character with wchar_t good? Does it not cause any problems?


I have been searching for a way to handle Polish words. I read about utf8, 16, 32, but any conversion from char to utf gives me different letter.

wchar_t gives a correct letter, though.

Is it ok to do it that way?

What about performance if for example I will use only ascii, just because? Does it impact an application any way?


Solution

  • You are confusing two different things:

    1. Storage

      How you store the bytes that make up your text string. Will that be in an array of char (single-byte) values? Or will it be in the form of wchar_t (multi-byte) values?

    2. Encoding

      Your computer (and you!) needs to know what to do with the values in those bytes. What do they mean? Regardless of storage, they could be ASCII, some code page, UTF-8, UTF-16, UTF-32, Klingon, anything.

    Usually, for historical reasons, we pick char for single-byte encodings (e.g. ASCII) and UTF-8, and wchar_t for UTF-16 (particularly on Windows, which has 16-bit wchar_ts and generally assumes this combination throughout its API — note that it inaccurately calls this simply "Unicode").

    Performance doesn't really come into it, though you'll save time and energy converting between different encodings if you pick one and stick to it (and use a storage mechanism that fits the string libraries you're using). Sometimes your OS will help determine that choice, but we can't tell you what it will be.

    Similarly, your statements about what "works" and "doesn't work" are very vague, and likely false.

    We can't say what's "ok" without knowing the requirements of your project, and what sort of computer it'll run on, and with what technologies. I will, though, make a tremendous generalisation: in the olden days, you might have used Mazovia encoding, an altered codepage that included Polish characters; nowadays, you probably want to make portability and interchange as easy as possible (because why not?!), so you'd be encouraged to stick with UTF-16 over wchar_t on Windows, and UTF-8 over char otherwise.

    (From C++20 we'll also have char8_t, a storage mechanism specifically designed to signify that it stores UTF-8-encoded data; however, it's going to be some time before you see this in widespread use, if at all. You can read more about C++'s character types on cppreference.com's article about "Fundamental types")