I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just doing something wrong).
This minor function simply detects surrogate pairs using the regular expression trick outlined here. Here's the current implementation:
let isSurrogatePair input =
Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]")
If I then execute it against a known surrogate pair like this:
let result = isSurrogatePair "𠮷野𠮷"
printfn "%b" result
I get false
in the FSI window.
If I use the equivalent C#:
public bool IsSurrogatePair(string input)
{
return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]");
}
And the same input value, I (correctly) get true
back.
Is this a true issue? Am I simply doing something wrong in my F# implementation?
There appears to be a bug in how F# encodes escaped Unicode characters.
Here's from the F# Interactive (note the last two results):
> "\uD500".[0] |> uint16 ;;
val it : uint16 = 54528us
> "\uD700".[0] |> uint16 ;;
val it : uint16 = 55040us
> "\uD800".[0] |> uint16 ;;
val it : uint16 = 65533us
> "\uD900".[0] |> uint16 ;;
val it : uint16 = 65533us
Fortunately, this workaround works:
> let s = new System.String( [| char 0xD800 |] )
s.[0] |> uint16
;;
val s : System.String = "�"
val it : uint16 = 55296us
Based on that finding, I can construct a corrected (or, rather, workarounded) version of isSurrogatePair
:
let isSurrogatePair input =
let chrToStr code = new System.String( [| char code |] )
let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]"
Regex.IsMatch(input, regex)
This version correctly returns true
for your input.
I have just filed this issue on GitHub: https://github.com/Microsoft/visualfsharp/issues/338