I have a problem. I'm writing an app in Polish (with, of course, polish chars) for Linux and I receive 80 warnings when compiling. These are just "warning: multi-character character constant" and "warning: case label value exceeds maximum value for type". I'm using std::string.
How do I replace std::string class?
Please help. Thanks in advance. Regards.
std::string
does not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:
.c_str()
will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take a const char*
parameter without a lenght, or your data will be truncated.char
does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note that wchar_t
does necessarily hold a full character either, depending on UTF-16 normalization. .size()
and .length()
will return the number of bytes, not the number of characters.[edit] The warnings about case
labels is related to issue (2). You are using a switch
statement with multi-byte characters using type char
which can not hold more than one byte.[/edit]
Therefore, you can use std::string
in your application, provided that you respect these three rules. There are subtleties involving the STL, including std::find()
that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.
However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside [0, 128)
), you need to be aware of encodings in different sources of textual data.
These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).
I don't suggest using std::wstring
on Linux because 100% of native APIs use function signatures with const char*
and have direct support for UTF-8. If you use any string class based on wchar_t
, you will need to convert to/from std::wstring
non-stop and eventually get something wrong, on top of making everything slow(er).
If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use const wchar_t*
signatures. The ANSI versions of such functions perform an internal conversion to/from const wchar_t*
.
Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with char
on Linux and UTF-16 with wchar_t
on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.