Search code examples
c++jsonunicode

Reading and converting cyrillic unicode characters from JSON file in C++


I have a JSON file with the following content (for example):

{
    "excel_filepath": "excel_file.xlsx",
    "line_length": 5.0,
    "record_frequency": 2.5,
    "report_file_name": "\u041f\u0421 \u041f\u0440\u043e\u043c\u0437\u043e\u043d\u0430 - \u041f\u0421 \u041f\u043e\u0433\u043e\u0440\u0435\u043b\u043e\u0432\u043e (\u0426.1)",
    "line_type": 1,
}

This JSON file is generated by Python script.

For reading the JSON file, I use the <nlohmann/json.hpp> library (I found it simple for my case):

using json = nlohmann::json;

std::ifstream f("temp_data.json");
json data = json::parse(f);

What I want to do is to read the "report_file_name" value and create a simple .txt file named as the value of the report_file_name key, which is stored as Unicode, as you can see.

What I am trying to do is as follows:

_setmode(_fileno(stdout), _O_U16TEXT);
const locale utf8_locale = locale(locale(), new codecvt_utf8<wchar_t>());

string report_file_name = data["report_file_name"];
    
for (auto unicode_char : report_file_name) 
{
    wcout << typeid(unicode_char).name() << ": " << unicode_char << endl;
}

wofstream report_file(report_file_name + L".txt");
report_file.imbue(utf8_locale);

This gives an output as:

char: Ð  
char:  
char: Ð  
char: ¡  
char:  
char: Ð  
char:  
char: Ñ  
char:  
char: Ð  
char: ¾
... and so on

I have to note that I somehow managed to write Cyrillic letters into a report file. Interestingly, when I do:

wcout << L"\u041f\u0421" << endl;

It prints out Cyrillic letters (ПС) correctly. Also, no problem with creating the report .txt file with a Cyrillic name from code:

wofstream report_file(L"Отчет.txt"); // fine!

Am I doing something wrong? I'm using Windows 10, MVS 2022 with C++17 Standard. If this is helpful.


Solution

  • Per nlohmann::json's documentation:

    https://github.com/nlohmann/json#character-encoding

    Character encoding

    The library supports Unicode input as follows:

    • Only UTF-8 encoded input is supported which is the default encoding for JSON according to RFC 8259.
    • std::u16string and std::u32string can be parsed, assuming UTF-16 and UTF-32 encoding, respectively. These encodings are not supported when reading from files or other input containers.
    • Other encodings such as Latin-1 or ISO 8859-1 are not supported and will yield parse or serialization errors.
    • Unicode noncharacters will not be replaced by the library.
    • Invalid surrogates (e.g., incomplete pairs such as \uDEAD) will yield parse errors.
    • The strings stored in the library are UTF-8 encoded. When using the default string type (std::string), note that its length/size functions return the number of stored bytes rather than the number of characters or glyphs.
    • When you store strings with different encodings in the library, calling dump() may throw an exception unless json::error_handler_t::replace or json::error_handler_t::ignore are used as error handlers.
    • To store wide strings (e.g., std::wstring), you need to convert them to a UTF-8 encoded std::string before, see an example.

    So, in your case, your report_file_name string is a UTF-8 encoded std::string, which you will need to decode into a std::wstring (UTF-16 on Windows, UTF-32 on other platforms) before you can use it with std::wofstream, eg:

    std::wstring utf8_to_wstr(const std::string &uf8)
    {
        // there are many questions on StackOverflow about how to do this conversion.
        // You can use the Win32 MultiByteToWideChar() API, or std::wstring_convert
        // with std::std::codecvt_utf8/_utf16, or a 3rd party Unicode library such as
        // ICU or iconv...
    }
    
    ...
    
    wstring report_file_name = utf8_to_wstr(data["report_file_name"]);