Search code examples
c#compact-frameworkcharacter-encodinginternationalizationutf-16

Advice on marshalled string that can be either ASCII or UTF-16


Welcome to unsafe land.

I'm doing P/Invoke to a legacy lib that gives me a 0-terminated C-style string in the form of an unknown-length unmanaged byte buffer that can be either ASCII or UTF-16, but without giving any indication whatsoever thereof - other than the byte stream itself that is...

Right now I have a bad scheme, based on checking for single and double 0-bytes, to decide if I should create a managed String from Char* or SByte*. The scheme obviously breaks down for every Unicode code-point higher than U+00FF.

This is what I have:

  • The address of the unmanaged byte buffer.
  • The unmanaged byte buffer is of unknown length.
  • The unmanaged byte buffer is either a 0-terminated ASCII C-style string or a 0-terminated UTF-16 C-style string.

This is what I want:

  • Create a correct managed String from the unmanaged byte buffer, whether it's ASCII or UTF-16.

Is that problem generically solvable?


Solution

  • I don't think this can be solved 100%. If the buffer contains 6c 34 00 00 ("l4"), is that the Chinese sign for water, or just an ASCII lower L and 4? But it should be possible to guess right "most of the time" depending on the specific strings.

    Is the UTF-16 little endian or (probably) big endian?

    The largest risk is buffer overrun. For instance, if the buffer starts with a 00, is that a zero-length ASCII string or should we try ready more of the buffer interpreting it as UTF-16BE?