Search code examples
c++linuxwarningsstdstringpolish

Polish chars in std::string


I have a problem. I'm writing an app in Polish (with, of course, polish chars) for Linux and I receive 80 warnings when compiling. These are just "warning: multi-character character constant" and "warning: case label value exceeds maximum value for type". I'm using std::string.

How do I replace std::string class?

Please help. Thanks in advance. Regards.


Solution

  • std::stringdoes not define a particular encoding. You can thus store any sequence of bytes in it. There are subtleties to be aware of:

    1. .c_str() will return a null-terminated buffer. If your character set allows null bytes, don't pass this string to functions that take a const char* parameter without a lenght, or your data will be truncated.
    2. A char does not represent a character, but a **byte. IMHO, this is the most problematic nomenclature in computing history. Note that wchar_t does necessarily hold a full character either, depending on UTF-16 normalization.
    3. .size() and .length() will return the number of bytes, not the number of characters.

    [edit] The warnings about case labels is related to issue (2). You are using a switch statement with multi-byte characters using type char which can not hold more than one byte.[/edit]

    Therefore, you can use std::string in your application, provided that you respect these three rules. There are subtleties involving the STL, including std::find() that are consequences of this. You need to use some more clever string matching algorithms to properly support Unicode because of normalization forms.

    However, when writing applications in any language that uses non-ASCII characters (if you're paranoid, consider this anything outside [0, 128)), you need to be aware of encodings in different sources of textual data.

    1. The source-file encoding might not be specified, and might be subject to change using compiler options. Any string literal will be subject to this rule. I guess this is why you are getting warnings.
    2. You will get a variety of character encodings from external sources (files, user input, etc.). When that source specifies the encoding or you can get it from some external source (i.e. asking the user that imports the data), then this is easier. A lot of (newer) internet protocols impose ASCII or UTF-8 unless otherwise specified.

    These two issues are not addressed by any particular string class. You just need to convert all any external source to your internal encoding. I suggest UTF-8 all the time, but especially so on Linux because of native support. I strongly recommend to place your string literals in a message file to forget about issue (1) and only deal with issue (2).

    I don't suggest using std::wstring on Linux because 100% of native APIs use function signatures with const char* and have direct support for UTF-8. If you use any string class based on wchar_t, you will need to convert to/from std::wstring non-stop and eventually get something wrong, on top of making everything slow(er).

    If you were writing an application for Windows, I'd suggest exactly the opposite because all native APIs use const wchar_t* signatures. The ANSI versions of such functions perform an internal conversion to/from const wchar_t*.

    Some "portable" libraries/languages use different representations based on the platform. They use UTF-8 with char on Linux and UTF-16 with wchar_t on Windows. I recall reading bout that trick in the Python reference implementation but the article was quite old. I'm not sure if that is true anymore.