Search code examples
.netregexunicodef#surrogate-pairs

Surrogate Pair Detection Fails


I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just doing something wrong).

This minor function simply detects surrogate pairs using the regular expression trick outlined here. Here's the current implementation:

let isSurrogatePair input =
    Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]")

If I then execute it against a known surrogate pair like this:

let result = isSurrogatePair "𠮷野𠮷"
printfn "%b" result

I get false in the FSI window.

If I use the equivalent C#:

public bool IsSurrogatePair(string input)
{
    return Regex.IsMatch(input, "[\uD800-\uDBFF][\uDC00-\uDFFF]");
}

And the same input value, I (correctly) get true back.

Is this a true issue? Am I simply doing something wrong in my F# implementation?


Solution

  • There appears to be a bug in how F# encodes escaped Unicode characters.
    Here's from the F# Interactive (note the last two results):

    > "\uD500".[0] |> uint16 ;;
    val it : uint16 = 54528us
    > "\uD700".[0] |> uint16 ;;
    val it : uint16 = 55040us
    > "\uD800".[0] |> uint16 ;;
    val it : uint16 = 65533us
    > "\uD900".[0] |> uint16 ;;
    val it : uint16 = 65533us
    

    Fortunately, this workaround works:

    > let s = new System.String( [| char 0xD800 |] )
    s.[0] |> uint16
    ;;
    
    val s : System.String = "�"
    val it : uint16 = 55296us
    

    Based on that finding, I can construct a corrected (or, rather, workarounded) version of isSurrogatePair:

    let isSurrogatePair input =
      let chrToStr code = new System.String( [| char code |] )
      let regex = "[" + (chrToStr 0xD800) + "-" + (chrToStr 0xDBFF) + "][" + (chrToStr 0xDC00) + "-" + (chrToStr 0xDFFF) + "]"
      Regex.IsMatch(input,  regex)
    

    This version correctly returns true for your input.

    I have just filed this issue on GitHub: https://github.com/Microsoft/visualfsharp/issues/338