Search code examples
c++c++11xcode6osx-mavericksstring-literals

How are string literals stored in memory for c++?


I have a question about how string literals are stored in memory for c++. I know that a char is stored according to their ascii code, but I am rather after the unicode character set. The reason for this is that I try to deal with some locales. Let us assume that what I am trying to do is to convert lower case characters to upper case. This works in Xcode terminal,

#include <iostream>
#include <string>
#include <cctype>
#include <clocale>

using namespace std;

int main()
{
wcout.imbue(std::locale("sv_SE.Utf-8"));
const std::ctype<wchar_t>& f = std::use_facet< std::ctype<wchar_t> >(std::locale("sv_SE.Utf-8"));

wstring str {L"åäö"}; // Swedish letters

f.toupper(&str[0], &str[0] + str.size());

std::wcout << str.length() << std::endl;
std::wcout << str << std::endl;
}

Output:
3
ÅÄÖ

However, when I try to run it in OS X terminal I get rubbish,

Output:
3
ÅÄÖ

Further when I prompt the user for input instead,

#include <iostream>
#include <string>
#include <cctype>
#include <clocale>

using namespace std;

int main()
{
wcin.imbue(std::locale(""));
wcout.imbue(std::locale("sv_SE.Utf-8"));
const std::ctype<wchar_t>& f = std::use_facet< std::ctype<wchar_t> >(std::locale("sv_SE.Utf-8"));

//wstring str {L"åäö"};
wcout << "Write something>> ";
wstring str;
getline(wcin, str);

f.toupper(&str[0], &str[0] + str.size());

std::wcout << str.length() << std::endl;
std::wcout << str << std::endl;
}

I get rubbish from Xcode terminal,

Output:
Write something>> åäö
6
åäö

And the OS X termial actually hangs when I use these letters. It is possible to modify the wcin stream to assume C encoding wcin.imbue(std::locale());, which still give the same output in Xcode, but gives following in OS X terminal:

Output:
Write something>> åäö
3
ŒŠš

So the problem is quite clearly related to encodings. So what I wonder how the string literals are actually stored in memory in c++. This can be split into 2 different cases.

Case 1: A string literal typed in source code, eg wstring str {L"åäö"};.

Case 2: A string entered via standard input stream (wcin in this case).

These two cases does not necessarily store the strings in the same way. I know that unicode is a character set and that utf-8 is an encoding, so what I wonder is more if the string literals are encoded when stored in memory and in that case how.

Further, if anyone know how to identify the encoding used in the current terminal in an automatic way it would be great.

BR Patrik

EDIT

I get some comment which, even though some of them are good, are not exactly related to the question. This means that the question probably needs some clarification. The question can be seen as a generalization of the fairly ill formulated question:

"Can I assume that string literals are stored with their unicode pointcode in memory?"

This question is badly formulated for at least two reasons. First it make an assumption about how the string literals are stored (with their unicode codepoint). This means that the answer must relate to unicode, even though this relation may be completely pointless. Further this question is a yes or no type of question, which will give no help in case the answer is no.

I also understand that this can be tested converting the codepoint to its integer equivalent and print it, but this would require that I test it towards the entire unicode character set (which seems to be an unreasonable way of doing this).


Solution

  • First the way the file is interpreted as a sequence of characters is implementation defined. You have to consult your compiler documentation for determining this.

    Second the character set that is used is also implementation defined. So again you have to consult your compiler for this.

    What's likely to happen when you insert non-ascii characters (possibly when using ascii too) is that the compiler would interpret them differently. You have to check that the different compilers actually can handle the same encoding, the most likely source encoding to work portably would be UTF-8.

    In addition maybe you would be better of using UTF-8-encoded text for the most of the program (only near API that requires wchar_t would need to handle the strings this way).

    Bottom line. Make sure your compiler stores the string literal verbatim and use ordinary (narrow) strings, and use an editor that saves in UTF-8 encoding.