Welcome to unsafe
land.
I'm doing P/Invoke to a legacy lib that gives me a 0-terminated C-style string in the form of an unknown-length unmanaged byte buffer that can be either ASCII or UTF-16, but without giving any indication whatsoever thereof - other than the byte stream itself that is...
Right now I have a bad scheme, based on checking for single and double 0-bytes, to decide if I should create a managed String
from Char*
or SByte*
. The scheme obviously breaks down for every Unicode code-point higher than U+00FF
.
This is what I have:
This is what I want:
String
from the unmanaged byte buffer, whether it's ASCII or UTF-16.Is that problem generically solvable?
I don't think this can be solved 100%. If the buffer contains 6c 34 00 00 ("l4"), is that the Chinese sign for water, or just an ASCII lower L and 4? But it should be possible to guess right "most of the time" depending on the specific strings.
Is the UTF-16 little endian or (probably) big endian?
The largest risk is buffer overrun. For instance, if the buffer starts with a 00, is that a zero-length ASCII string or should we try ready more of the buffer interpreting it as UTF-16BE?