Search code examples
c++arraysunicodeutf-8codepoint

Unicode to CodePoint C++


How can I get the codepoint from a Unicode value? According the character code table, the Code Point for the pictogram '丂' is 8140, and the Unicode is \u4E02

I made this app on C++, to try to get the CP for a Unicode string value:

#include <iostream>
#include <atlstr.h>
#include <iomanip>
#include <codecvt>

void hex_print(const std::string& s);

int main()
{
    std::wstring test = L"丂"; //assign pictogram directly
    std::wstring test2 = L"\u4E02"; //assign value via Unicode

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
    std::string u8str = conv1.to_bytes(test);
    hex_print(u8str);

    std::wstring_convert<std::codecvt_utf16<wchar_t>> conv2;
    std::string u8str2 = conv2.to_bytes(test2);
    hex_print(u8str2);

    return 1;

}

void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for (unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

Output:

00 81 00 40
4e 02

What can I do to get 00 81 00 40, when the value is \u4E02?


Solution

  • In Windows you can use WideCharToMultiByte

    int main()
    {
        std::wstring test = L"丂"; //assign pictogram directly
        std::wstring test2 = L"\u4E02"; //assign value via Unicode
    
        std::wstring_convert<std::codecvt_utf16<wchar_t>> conv1;
        std::string u8str = conv1.to_bytes(test);
        hex_print(u8str);
    
        std::wstring_convert<std::codecvt_utf16<wchar_t>> conv2;
        std::string u8str2 = conv2.to_bytes(test2);
        hex_print(u8str2);
    
        int len = WideCharToMultiByte(54936, 0, test2.c_str(), -1, NULL, 0, NULL, NULL);
        char* strGB18030 = new char[len + 1];
        WideCharToMultiByte(54936, 0, test2.c_str(), -1, strGB18030, len, NULL, NULL);
        hex_print(std::string(strGB18030));
        delete[] strGB18030;
    
        return 1;
    
    }
    

    output

    4e 02
    4e 02
    81 40