Search code examples
c++linuxicuwchar-t

Convert UTF-32 wide char to UTF-16 wide char in Linux for Supplementary Plane characters


We have a C++ application deployed on RHEL using ICU.

We have a situation where in we need to convert UChar* to wchar_t* on linux. We use u_strToWCS to perform the conversion.

#include <iostream>
#include <wchar.h>

#include "unicode/ustring.h"

void convertUnicodeStringtoWideChar(const UChar* cuniszSource,
                                    const int32_t cunii32SourceLength,
                                    wchar_t*& rpwcharDestination,
                                    int32_t& destCapacity)
{
  UErrorCode uniUErrorCode = U_ZERO_ERROR;

  int32_t pDestLength = 0;

  rpwcharDestination     = 0;
  destCapacity = 0;

  u_strToWCS(rpwcharDestination,
             destCapacity,
             &pDestLength,
             cuniszSource,
             cunii32SourceLength,
             &uniUErrorCode);

  uniUErrorCode = U_ZERO_ERROR;
  rpwcharDestination = new wchar_t[pDestLength+1];
  if(rpwcharDestination)
  {
    destCapacity = pDestLength+1;

    u_strToWCS(rpwcharDestination,
               destCapacity,
               &pDestLength,
               cuniszSource,
               cunii32SourceLength,
               &uniUErrorCode);

    destCapacity = wcslen(rpwcharDestination);
  }
} //function ends

int main()
{
    //                     a       ä       Š       €    (     王     )
    UChar input[20] = { 0x0061, 0x00e4, 0x0160, 0x20ac, 0xd87e, 0xdd29, 0x0000 };
    wchar_t * output;
    int32_t outlen = 0;
    convertUnicodeStringtoWideChar( input, 6, output, outlen );
    for ( int i = 0; i < outlen; ++i )
    {
        std::cout << std::hex << output[i] << "\n";
    }
    return 0;
}

This works fine for characters entered upto 65535 (as UChar is implemented as uint16_t internally on linux). It fails to convert characters outside Basic Multilingual Plane (eg CJK Unified Ideographs Extension B)

Any ideas on how to perform the conversion?

Update 1: OK. I was looking at wrong directions. u_strToWCS works fine. The problem arises because I need to pass that wide string to a java application on windows using CORBA. Since wchar_t in linux is 32bit, I need to find a way to convert 32bit wchar_t to 16bit wchar_t

Update 2: The code which I have used can be found here


Solution

  • The following is the code to convert UTF-32 encoded wide characters to UTF-16

    //Function to convert a Unicode string from platform-specific "wide characters" (wchar_t) to UTF-16.
    void ConvertUTF32ToUTF16(wchar_t* source,
                             const uint32_t sourceLength,
                             wchar_t*& destination,
                             uint32_t& destinationLength)
    {
    
      wchar_t wcharCharacter;
      uint32_t uniui32Counter = 0;
    
      wchar_t* pwszDestinationStart = destination;
      wchar_t* sourceStart = source;
    
      if(0 != destination)
      {
        while(uniui32Counter < sourceLength)
        {
          wcharCharacter = *source++;
          if(wcharCharacter <= 0x0000FFFF)
          {
            /* UTF-16 surrogate values are illegal in UTF-32
               0xFFFF or 0xFFFE are both reserved values */
            if(wcharCharacter >= 0xD800 && 
               wcharCharacter <= 0xDFFF)
            {
              *destination++ = 0x0000FFFD;
              destinationLength += 1;
            }
            else
            {
              /* source is a BMP Character */
              destinationLength += 1;
              *destination++ = wcharCharacter;
            }
          }
          else if(wcharCharacter > 0x0010FFFF)
          {
            /* U+10FFFF is the largest code point of Unicode Character Set */
            *destination++ = 0x0000FFFD;
            destinationLength += 1;
          }
          else
          {
            /* source is a character in range 0xFFFF - 0x10FFFF */
            wcharCharacter -= 0x0010000UL;
            *destination++ = (wchar_t)((wcharCharacter >> 10) + 0xD800);
            *destination++ = (wchar_t)((wcharCharacter & 0x3FFUL) + 0xDC00);
            destinationLength += 2;
          }
    
          ++uniui32Counter;
        }
    
        destination = pwszDestinationStart;
        destination[destinationLength] = '\0';
      }
    
      source = sourceStart;
    } //function ends