Search code examples
c++unicodeutf-8icuucs2

UTF-8 to UCS-2 conversion with icu library


I'm currently working on and hitting an issue with converting a UTF-8 string to a UCS-2 string with the icu library. There are several number of ways to do this in the library, but so far none of them seem to be working, but considering the popularity of this library I'm under the assumption that I'm doing something wrong.

First off is the common code. In all cases I'm creating and passing a string on an object, but until it reaches the conversion steps there is no manipulation.

The currently utf-8 string being used is simply "ĩ".

For the sake of simplicity I'll represent the string being used as uniString in this code

UErrorCode resultCode = U_ZERO_ERROR;

UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);

// Change the callback to error out instead of the default            
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);

int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];                       

printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
    // outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
    outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
        uniString.length(), &resultCode);
    ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
    outputLength ? target : "invalid_char", resultCode, outputLength);

if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
    if (resultCode == U_INVALID_CHAR_FOUND)
    {
        printf("Unmapped input character, cannot be converted to Latin1");                    

        m_pConv = ucnv_open("UCS-2", &resultCode);
        if (U_SUCCESS(resultCode))
        {
            // outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
            outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
                uniString.length(), &resultCode);
            ucnv_close(m_pConv);
        }

        printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
            outputLength ? target : "invalid_char", resultCode, outputLength);

        if (U_SUCCESS(resultCode))
        {
            pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
        }
    }
    else
    {
        printf("DecodeText(): Text contents does not appear to be valid UTF-8");
    }
}
else
{
    printf("DecodeText(): Text successfully converted to Latin1");
    std::string newBody(target, outputLength);
    pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}

The problem is the ucnv_fromAlgorithmic function is throwing an error U_INVALID_CHAR_FOUND for the ucs-2 conversion. This makes sense for the ISO-8859-1 attempt, but not the ucs-2.

The other attempt was to use ucnv_convert which you can see is commented out. This function attempted conversion, but didn't fail on the ISO-8859-1 attempt as it should.

So the question is, does anyone have experience with these function and see something incorrect or is there something incorrect about the assumption of conversion for this character?


Solution

  • You need to reset resultCode to U_ZERO_ERROR before calling ucnv_open. Quote from manual:

    "ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test if(U_FAILURE(errorCode)) { return immediately; } so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operation"