I'm working with strings, which could contain surrogate unicode characters (non-BMP, 4 bytes per character).
When I use "\Uxxxxxxxxv" format to specify surrogate character in F# - for some characters it gives different result than in the case of C#. For example:
C#:
string s = "\U0001D11E";
bool c = Char.IsSurrogate(s, 0);
Console.WriteLine(String.Format("Length: {0}, is surrogate: {1}", s.Length, c));
Gives: Length: 2, is surrogate: True
F#:
let s = "\U0001D11E"
let c = Char.IsSurrogate(s, 0)
printf "Length: %d, is surrogate: %b" s.Length c
Gives: Length: 2, is surrogate: false
Note: Some surrogate characters works in F# ("\U0010011", "\U00100011"), but some of them doesn't work.
Q: Is this is bug in F#? How can I handle allowed surrogate unicode characters in strings with F# (Does F# has different format, or only the way is to use Char.ConvertFromUtf32 0x1D11E
)
Update:
s.ToCharArray()
gives for F# [| 0xD800; 0xDF41 |]
; for C# { 0xD834, 0xDD1E }
That obviously means that F# makes mistake while parsing some string literals. That is proven by the fact character you've mentioned is non-BMP, and in UTF-16 it should be represented as pair of surrogates. Surrogates are words in range 0xD800-0xDFFF, while neither of chars in produced string fits in that range.
But processing of surrogates doesn't change, as framework (what is under the hood) is the same. So you already have answer in your question - if you need string literals with non-BMP characters in your code, you should just use Char.ConvertFromUtf32
instead of \UXXXXXXXX notation. And all the rest processing will be just the same as always.