Reading and converting cyrillic unicode characters from JSON file in C++

I have a JSON file with the following content (for example):

{
    "excel_filepath": "excel_file.xlsx",
    "line_length": 5.0,
    "record_frequency": 2.5,
    "report_file_name": "\u041f\u0421 \u041f\u0440\u043e\u043c\u0437\u043e\u043d\u0430 - \u041f\u0421 \u041f\u043e\u0433\u043e\u0440\u0435\u043b\u043e\u0432\u043e (\u0426.1)",
    "line_type": 1,
}

This JSON file is generated by Python script.

For reading the JSON file, I use the <nlohmann/json.hpp> library (I found it simple for my case):

using json = nlohmann::json;

std::ifstream f("temp_data.json");
json data = json::parse(f);

What I want to do is to read the "report_file_name" value and create a simple .txt file named as the value of the report_file_name key, which is stored as Unicode, as you can see.

What I am trying to do is as follows:

_setmode(_fileno(stdout), _O_U16TEXT);
const locale utf8_locale = locale(locale(), new codecvt_utf8<wchar_t>());

string report_file_name = data["report_file_name"];
    
for (auto unicode_char : report_file_name) 
{
    wcout << typeid(unicode_char).name() << ": " << unicode_char << endl;
}

wofstream report_file(report_file_name + L".txt");
report_file.imbue(utf8_locale);

This gives an output as:

char: Ð  
char:  
char: Ð  
char: ¡  
char:  
char: Ð  
char:  
char: Ñ  
char:  
char: Ð  
char: ¾
... and so on

I have to note that I somehow managed to write Cyrillic letters into a report file. Interestingly, when I do:

wcout << L"\u041f\u0421" << endl;

It prints out Cyrillic letters (ПС) correctly. Also, no problem with creating the report .txt file with a Cyrillic name from code:

wofstream report_file(L"Отчет.txt"); // fine!

Am I doing something wrong? I'm using Windows 10, MVS 2022 with C++17 Standard. If this is helpful.

Solution

Per nlohmann::json's documentation:

https://github.com/nlohmann/json#character-encoding

Character encoding

The library supports Unicode input as follows:

Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259.

std::u16string and std::u32string can be parsed, assuming UTF-16 and UTF-32 encoding, respectively. These encodings are not supported when reading from files or other input containers.

Other encodings such as Latin-1 or ISO 8859-1 are not supported and will yield parse or serialization errors.

Unicode noncharacters will not be replaced by the library.

Invalid surrogates (e.g., incomplete pairs such as \uDEAD) will yield parse errors.

The strings stored in the library are UTF-8 encoded. When using the default string type (std::string), note that its length/size functions return the number of stored bytes rather than the number of characters or glyphs.

When you store strings with different encodings in the library, calling dump() may throw an exception unless json::error_handler_t::replace or json::error_handler_t::ignore are used as error handlers.

To store wide strings (e.g., std::wstring), you need to convert them to a UTF-8 encoded std::string before, see an example.

So, in your case, your report_file_name string is a UTF-8 encoded std::string, which you will need to decode into a std::wstring (UTF-16 on Windows, UTF-32 on other platforms) before you can use it with std::wofstream, eg:

std::wstring utf8_to_wstr(const std::string &uf8)
{
    // there are many questions on StackOverflow about how to do this conversion.
    // You can use the Win32 MultiByteToWideChar() API, or std::wstring_convert
    // with std::std::codecvt_utf8/_utf16, or a 3rd party Unicode library such as
    // ICU or iconv...
}

...

wstring report_file_name = utf8_to_wstr(data["report_file_name"]);