Search code examples
c++jsonunicodec++14

Get Unicode from json file


I am working on a C++ project (C++14) and I am facing the following issue. I have a JSON file with a field containing Unicode characters. I am using nlohmann/json to retrieve information from this field. However, when I retrieve it, it doesn't match my expectations. Below is my source code:

std::u16string Str16 = u"\u9db4\u5c4b\u516b\u5e61 \u4eac\u962a\u767e\u8ca8\u5e97\u5b88\u53e3\u5e97";//L"鶴屋八幡 京阪百貨店守口店"
std::ifstream jsonFile("name.json");

if (!jsonFile.is_open()) {
    std::cerr << "can't open JSON" << std::endl;
    return 1;
}

nlohmann::json jsonData;
jsonFile >> jsonData;

std::string unicodeStr = jsonData["name"];

jsonFile.close();
return 0;

name.json

{
"name": "\u9db4\u5c4b\u516b\u5e61 \u4eac\u962a\u767e\u8ca8\u5e97\u5b88\u53e3\u5e97"
}

My desire is to be able to transform the Unicode string 'unicodeStr' into a 'char16_t' string or 'u16string'. Or get unicodeStr = "\u9db4\u5c4b\u516b\u5e61 \u4eac\u962a\u767e\u8ca8\u5e97\u5b88\u53e3\u5e97"

Can someone help me with this issue?


Solution

  • Your code is working fine. The problem is elsewhere.

    All things in memory are just byte values. The idea that certain patterns represent integers, floats, characters, or string, requires you to make assumptions about how you interact with that memory. Notably, there are hundreds of different encodings for storing text in bytes, (9+ of these encodings have full Unicode support) and std::string can hold... basically all but 3 of those.

    Most developers just "use the defaults", which on American versions of Windows, assumes all text is stored in the Windows-1252 encoding. Developers also assume that each "Character" is 1 char, which is true for Windows-1252, but is an incorrect assumption for many other encodings. Notably, this assumption is incorrect for all 9+ Unicode encodings.

    However, nlohmann is giving you a std::string whose internal text encoded in UTF-8. Therefore, when your other code attempts to use this text (such as passing it to std::cout), that other code is decoding the bytes as some other encoding (probably Windows-1252), which is resulting in 鶴屋八幡 京阪百貨店守å£åº—. This is a very common bug, commonly called Mojibake (Note that the text in the sample image even looks virtually identical to your results)

    There's generally two ways to approach this problem:

    • Looks easy, actually hard: Use some library to convert the text from utf-8 to the encoding that the rest of your program is using. Some libraries for this are ICU, boost, or Windows APIs. The rest of your code won't correctly handle Unicode text though.
    • Looks hard, actually easy: Fix your entire program to interpret std::string as UTF8. Windows has helper methods for this, but Linux usually just does this by default.