Search code examples
c++cutf-8character-encodingiso-8859-1

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?


I'm changing a software in C++, wich process texts in ISO Latin 1 format, to store data in a database in SQLite.
The problem is that SQLite works in UTF-8... and the Java modules that use same database work in UTF-8.

I wanted to have a way to convert the ISO Latin 1 characters to UTF-8 characters before storing in the database. I need it to work in Windows and Mac.

I heard ICU would do that, but I think it's too bloated. I just need a simple convertion system(preferably back and forth) for these 2 charsets.

How would I do that?


Solution

  • ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

    for each char:

    uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */
    
    if(ch < 0x80) {
        append(ch);
    } else {
        append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
        append(0x80 | (ch & 0x3f));
    }
    

    See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

    EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.