Search code examples
c++localeemojigetlinewstring

getline() reaches end of file when reading UTF-8 emoji character


I'm writing a C++ program that processes large delimited files.

I have a UTF-8 csv file that contains a row with the (emoji?) character 🌟. It looks something like this:

123,"james","piotrj🌟","1996-01-28"

When I call getline() on this row, it reads up to the emoji and then stops. So the resulting string from getline() is 123,"james","piotrj. I'm not sure exactly why it is happening. If I had to guess, I'm using locale improperly and this emoji (or part of it) is being read as an EOF.

I would like to read this row in as is, do some string operations, and then write it out to another file.

I have some example code here:

locale loc("en_US.UTF8");
wifstream inFile;
inFile.imbue(loc);
inFile.open("MyFile.csv");
if(inFile.is_open()){
  wstring str;
  if (getline(inFile, str)) {
    wcout << str << endl;
  }
  if (getline(inFile, str)) {
    wcout << str << endl;
  }
  inFile.close();
}

The output of this code is : 123,"james","piotrj. The second if statements body does not execute because the second getline() did not grab anything.

To try some things, I changed the locale to this:

locale loc = locale();

The name of the locale is "C" and that will get the entire line. The output of this program is: 123,"james","piotrj🌟","1996-01-28". This is a step in the right direction, but without the proper locale the wstring will not store it properly. In my program I do some individual character checking to see if the string could be represented in ANSI, thus I would really like the wstring to have that emoji as one character.


Solution

  • It looks like you are using libc++. Wide streams in this implementation do not support UTF-8 at all.

    Should you use libstdc++ instead, your program would work, except you would get transliterated text on the output. I am getting

    123,"james","piotrj?","1996-01-28"
    

    That's because the locale is not imbued in wcout. To get normal text, you would need to do either

    ios_base::sync_with_stdio(false);
    wcout.imbue(loc);
    

    (you cannot imbue a locale in a standard stream if it is synched with stdio)

    or, alternatively,

    locale::global(loc);
    

    Then your program would fully work.

    If you are tied to libc++, your only alternative is to use narrow character streams.

    Edit: with MSVC this code doesn't work either. Don't know why Microsoft claims UTF-8 support in newer versions of Windows, apparently it's not there at all. On Windows one can install gcc (one of several flavours, I recommend the UCRT flavour available with MSYS2). I cannot guarantee it will work though because ultimately the control flow passes through Microsoft runtime libraries. The proper solution is to never, ever use any wchar_t APIs except for calling specific WinAPI functions that require wchar_t. Use narrow characters, read UTF-8 from your file, store and manipulate strings as UTF-8, output them as UTF-8. I have tested this code converted to narrow characters with MSVC, and it works as expected for me.