Search code examples
boostunicodeutf-8fstream

Converting "\\u1234" to "\u1234"


I have a string that I have retrived from an html page using boost's regex_search(). Unfortunately, however, the japanese characters in the page are written as \u codes, and these are interpreted by regex_search as normal characters in a string.

So, my question is, how does one go about converting these codes to normal Unicode text? (UTF-8 obviously)

This is a fundamental issue with fstream having absolutely no regard for UTF-8. It looks like boost has its own implementation of fstream, but changing to it had no effect on my program, and I couldn't find any extra settings to configure boost's fstream to work with UTF-8 (although today is my first day ever working with boost, I could have missed it).

As a final note: I'm running this on linux, but I'd certainly appreciate a portable solution over a system-specific one.

Thanks all, I really appreciate the help :D


Solution

  • fstream is a narrow-character only stream (it's a typedef to basic_fstream<char>). std::wfstream would be the type you're looking for, although to be perfectly portable to, for example, Windows, you may have to introduce C++11 dependencies (Windows has no Unicode locales, but supports locale-independent Unicode conversions introduced by C++11. GCC on Linux doesn't support the new Unicode conversions, but has plenty of Unicode locales to choose from) or rely on boost.locale.

    Your steps would be:

    1. parse the string to obtain the hexadecimal values of the code points
    2. store them as wide characters.
    3. write them to a std::wofstream (or convert to UTF-8 first, and then write to std::ofstream)

    To illustrate the last step:

    #include <fstream>
    #include <locale>
    int main()
    {
        std::locale::global(std::locale("en_US.utf8")); // any utf8 works
        std::wofstream f("test.txt");
        f.imbue(std::locale());
    
        f << wchar_t(0x65e5) << wchar_t(0x672c) << wchar_t(0x8a9e) << '\n';
    }
    

    produces a file (on Linux) that contains e6 97 a5 e6 9c ac e8 aa 9e 0a