Search code examples
c++stringunicodeunicode-string

std::string conversion to char32_t (unicode characters)


I need to read a file using fstream in C++ that has ASCII as well as Unicode characters using the getline function.
But the function uses only std::string and these simple strings' characters can not be converted into char32_t so that I can compare them with Unicode characters. So please could any one give any fix.


Solution

  • char32_t corresponds to UTF-32 encoding, which is almost never used (and often poorly supported). Are you sure that your file is encoded in UTF-32?

    If you are sure, then you need to use std::u32string to store your string. For reading, you can use std::basic_stringstream<char32_t> for instance. However, please note that these types are generally poorly supported.

    Unicode is generally encoded using:

    • UTF-8 in text files (and web pages, etc...)

    • A platform-specific 16-bit or 32-bit encoding in programs, using type wchar_t

    So generally, universally encoded files are in UTF-8. They use a variable number of bytes for encoding characters, from 1(ASCII characters) to 4. This means you cannot directly test the individual chars using a std::string

    For this, you need to convert the UTF-8 string to wchar_t string, stored in a std::wstring.

    For this, use a converter defined like this:

    std::wstring_convert<std::codecvt_utf8<wchar_t> > converter;
    

    And convert like that:

    std::wstring unicodeString = converter.from_bytes(utf8String);
    

    You can then access the individual unicode characters. Don't forget to put a "L" before each string literals, to make it a unicode string literal. For instance:

    if(unicodeString[i]==L'仮')
    {
        info("this is some japanese character");
    }