Search code examples
c++utf-8character-encodingshift-jisdouble-byte

C++ ShiftJIS to UTF8 conversion


I need to convert Doublebyte characters. In my special case Shift-Jis into something better to handle, preferably with standard C++.

the following Question ended up without a workaround: Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

So is there anyone with a suggestion or a reference on how to handle this conversion with C++ standard?


Solution

  • Normally I would recommend using the ICU library, but for this alone, using it is way too much overhead.

    First a conversion function which takes an std::string with Shiftjis data, and returns an std::string with UTF8 (note 2019: no idea anymore if it works :))

    It uses a uint8_t array of 25088 elements (25088 byte), which is used as convTable in the code. The function does not fill this variable, you have to load it from eg. a file first. The second code part below is a program that can generate the file.

    The conversion function doesn't check if the input is valid ShiftJIS data.

    std::string sj2utf8(const std::string &input)
    {
        std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
        size_t indexInput = 0, indexOutput = 0;
    
        while(indexInput < input.length())
        {
            char arraySection = ((uint8_t)input[indexInput]) >> 4;
    
            size_t arrayOffset;
            if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
            else if(arraySection == 0x9) arrayOffset = 0x1100;
            else if(arraySection == 0xE) arrayOffset = 0x2100;
            else arrayOffset = 0; //this is one byte shiftjis
    
            //determining real array offset
            if(arrayOffset)
            {
                arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
                indexInput++;
                if(indexInput >= input.length()) break;
            }
            arrayOffset += (uint8_t)input[indexInput++];
            arrayOffset <<= 1;
    
            //unicode number is...
            uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];
    
            //converting to UTF8
            if(unicodeValue < 0x80)
            {
                output[indexOutput++] = unicodeValue;
            }
            else if(unicodeValue < 0x800)
            {
                output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
                output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
            }
            else
            {
                output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
                output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
                output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
            }
        }
    
        output.resize(indexOutput); //remove the unnecessary bytes
        return output;
    }
    

    About the helper file: I used to have a download here, but nowadays I only know unreliable file hosters. So... either http://s000.tinyupload.com/index.php?file_id=95737652978017682303 works for you, or:

    First download the "original" data from ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT . I can't paste this here because of the length, so we have to hope at least unicode.org stays online.

    Then use this program while piping/redirecting above text file in, and redirecting the binary output to a new file. (Needs a binary-safe shell, no idea if it works on Windows).

    #include <iostream>
    #include <string>
    #include <cstdint>
    #include <cstdio>
    
    using namespace std;
    
    // pipe SHIFTJIS.txt in and pipe to (binary) file out
    int main()
    {
        string s;
        uint8_t *mapping; //same bigendian array as in converting function
        mapping = new uint8_t[2*(256 + 3*256*16)];
    
        //initializing with space for invalid value, and then ASCII control chars
        for(size_t i = 32; i < 256 + 3*256*16; i++)
        {
            mapping[2 * i] = 0;
            mapping[2 * i + 1] = 0x20;
        }
        for(size_t i = 0; i < 32; i++)
        {
            mapping[2 * i] = 0;
            mapping[2 * i + 1] = i;
        }
    
        while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
        {
            if(s.substr(0, 2) != "0x") continue; //comment lines
    
            uint16_t shiftJisValue, unicodeValue;
            if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
            {
                puts("Error hex reading");
                continue;
            }
    
            size_t offset; //array offset
            if((shiftJisValue >> 8) == 0) offset = 0;
            else if((shiftJisValue >> 12) == 0x8) offset = 256;
            else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
            else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
            else
            {
                puts("Error input values");
                continue;
            }
    
            offset = 2 * (offset + (shiftJisValue & 0xfff));
            if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
            {
                puts("Error mapping not 1:1");
                continue;
            }
    
            mapping[offset] = unicodeValue >> 8;
            mapping[offset + 1] = unicodeValue & 0xff;
        }
    
        fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
        delete[] mapping;
        return 0;
    }
    

    Notes:
    Two-byte big endian raw unicode values (more than two byte not necessary here)
    First 256 chars (512 byte) for the single byte ShiftJIS chars, value 0x20 for invalid ones.
    Then 3 * 256*16 chars for the groups 0x8???, 0x9??? and 0xE???
    = 25088 byte