Search code examples
c#.netunicodecomcom-interop

COM methods, Char type and CharSet


This is a follow-up to my previous question: Does .NET interop copy array data back and forth, or does it pin the array?

My method is a COM interface method (rather than a DllImport method). The C# signature looks like this:

void Next(ref int pcch,
    [In, Out, MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 0)]
    char [] pchText);

MSDN says:

When a managed Char type, which has Unicode formatting by default, is passed to unmanaged code, the interop marshaler converts the character set to ANSI. You can apply the DllImportAttribute attribute to platform invoke declarations and the StructLayoutAttribute attribute to a COM interop declaration to control which character set a marshaled Char type uses.

Also, @HansPassant in his answer here says:

A char[] can't be marshaled as LPWStr, it has to be LPArray. Now the CharSet attribute plays a role, since you did not specify it, the char[] will be marshaled as an 8-bit char[], not a 16-bit wchar_t[]. The marshaled array element is not the same size (it is not "blittable") so the marshaller must copy the array.

Pretty undesirable, particularly given that your C++ code expects wchar_t. A very easy way to tell in this specific case is not getting anything back in the array. If the array is marshaled by copying then you have to tell the marshaller explicitly that the array needs to be copied back after the call. You'd have to apply the [In, Out] attribute on the argument. You'll get Chinese.

I coudn't find an analog of CharSet (normally used with DllImportAttribute and StructLayoutAttribute) which could be applied to a COM interface method.

Nevertheless, I don't get "Chinese" on the output. Everything seems to work fine, I do get correct Unicode characters back from COM.

Does it mean Char is always interpreted as WCHAR for COM method interop?

I couldn't find any documentation confirming or denying this.


Solution

  • I think this is a good question, and the char (System.Char) interop behavior does deserve some attention.

    In managed code, sizeof(char) is always equal 2 (two bytes), because in .NET characters are always Unicode.

    Nevertheless, the marshaling rules differ between cases when char for P/Invoke (calling an exported DLL API) and COM (calling a COM interface method).

    For P/Invoke, CharSet can be used explictly with any [DllImport] attribute, or implicitly via [module|assembly: DefaultCharSet(CharSet.Auto|Ansi|Unicode)], to change the default setting for all [DllImport] declarations per module or per assembly.

    The default value is CharSet.Ansi, which means there will be Unicode-to-ANSI conversion. I ussualy change the default to Unicode with [module: DefaultCharSet(CharSet.Unicode)], and then selectively use [DllImport(CharSet = CharSet.Ansi)] in those rare case where I need call an ANSI API.

    It is also possible to alter any specific char-typed parameter with MarshalAs(UnmanagedType.U1|U2) or MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.U1|U2) (for a char[] parameter). E.g., you may have something like this:

    [DllImport("Test.dll", ExactSpelling = true, CharSet = CharSet.Unicode)]
    static extern bool TestApi(
        int length,
        [In, Out, MarshalAs(UnmanagedType.LPArray] char[] buff1,
        [In, Out, MarshalAs(UnmanagedType.LPArray,
            ArraySubType = UnmanagedType.U1)] char[] buff2); 
    

    In this case, buff1 will be passed as an array of double-byte values (as is), but buff2 will be converted to and from an array of single byte values. Note, this still will be a smart, Unicode-to-OS-current-code-page (and back) conversion for buff2. E.g, a Unicode '\x20AC' () will become \x80 in the unmanaged code (rovided the OS code page is Windows-1252). This is how marshalling of MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.U1)] char[] buff would be different from MarshalAs(UnmanagedType.LPArray, ArraySubType = UnmanagedType.U1)] ushort[] buff. For ushort, 0x20AC would be simply converted to 0xAC.

    For calling a COM interface method, the story is quite different. There, char is always treated as a double-byte value representing a Unicode character. Perhaps, the reason for such design decision could be implied from Don Box's "Essential COM" (quoting the footnote from this page):

    The OLECHAR type was chosen in favor of the common TCHAR data type used by the Win32 API to alleviate the need to support two versions of each interface (CHAR and WCHAR). By supporting only one character type, object developers are decoupled from the state of the UNICODE preprocessor symbol used by their clients.

    Apparently, the same concept made its way to .NET. I'm pretty confident this is true even for legacy ANSI platforms (like Windows 95, where Marshal.SystemDefaultCharSize == 1).

    Note that DefaultCharSet doesn't have any effect on char when it's a part of the COM interface method signature. Neither there is a way to apply CharSet explicitly. However, you still have full control over the marshaling behavior of each individual parameter with MarshalAs, in exactly the same way as for P/Invoke above. E.g., your Next method might look like below, in case the unmanaged COM code expects a buffer of ANSI characters:

    void Next(ref int pcch,
        [In, Out, MarshalAs(UnmanagedType.LPArray, 
            ArraySubType = UnmanagedType.U1, SizeParamIndex = 0)] char [] pchText);