Search code examples
c#.netperformanceunicode

How effectively detect surrogate pair in a string?


I need to detect surrogate pairs in hundreds of thousands of strings of various lengths.

I wonder what might be the fastest and most efficient way to do this.

    string srcString = "𝜺𝜺 = 𝟏";
    var hasSurrogate = srcString.Any(c => '\uD800' <= c && c <= '\uDFFF');

Any works fine with its O(n) speed, but maybe there is a more efficient way to do this


Solution

  • The best thing you can do is measure. Have a couple implementations and benchmark them. I've compared using Any() vs. for. A simple for cycle is faster by a factor of 10.

    Method Mean Error StdDev
    HasSurrogateAny 8,095.7 ns 160.38 ns 230.01 ns
    HasSurrogateFor 765.8 ns 14.19 ns 13.27 ns

    I've used BenchmarkDotNet with this simple code:

    using BenchmarkDotNet.Attributes;
    using BenchmarkDotNet.Running;
    
    namespace PerfTests.Console;
    
    public class Program
    {
        public static void Main(string[] args)
        {
            var summary = BenchmarkRunner.Run<SurrogateBenchmark>();
        }
    }
    
    public class SurrogateBenchmark
    {
        private static string test = "".PadRight(1000) + "𝜺𝜺 = 𝟏";
    
        [Benchmark]
        public bool HasSurrogateAny()
        {
            return test.Any(c => '\uD800' <= c && c <= '\uDFFF');
        }
    
        [Benchmark]
        public bool HasSurrogateFor()
        {
            for (int i = 0; i < test.Length; i++)
            {
                char c = test[i];
                if ('\uD800' <= c && c <= '\uDFFF') return true;
            }
            return false;
        }
    }
    

    Note that I've prepended 1000 spaces before the actual surrogate, so that the algorithms have some work to do and do not encounter a surrogate as the very first character in the string.

    Environment: .NET 6.0.7, Intel Xeon Gold 16-core 2.4 GHz, WS2019 virtual machine

    It turns out that a foreach offers essentially the same performance, as well as a more readable version which uses Char.IsSurrogate():

    Method Mean Error StdDev
    HasSurrogateAny 8,630.1 ns 134.03 ns 125.37 ns
    HasSurrogateForEach 770.1 ns 10.31 ns 9.14 ns
    HasSurrogateForReadable 796.0 ns 13.41 ns 12.54 ns
    HasSurrogateFor 773.2 ns 11.46 ns 10.72 ns

    And, finally, in my setup a version utilizing a for over a Span<char> is actually a bit slower:

    Method Mean Error StdDev
    HasSurrogateFor 764.1 ns 12.99 ns 11.52 ns
    HasSurrogateForSpan 803.7 ns 15.95 ns 21.30 ns

    Code for the remaining three tests:

    [Benchmark]
    public bool HasSurrogateForEach()
    {
        foreach (char c in test)
        {
            if ('\uD800' <= c && c <= '\uDFFF') return true;
        }
        return false;
    }
    
    [Benchmark]
    public bool HasSurrogateForReadable()
    {
        for (int i = 0; i < test.Length; i++)
        {
            char c = test[i];
            if (Char.IsSurrogate(c)) return true;
        }
        return false;
    }
    
    [Benchmark]
    public bool HasSurrogateForSpan()
    {
        var s = test.AsSpan();
        for (int i = 0; i < s.Length; i++)
        {
            char c = test[i];
            if ('\uD800' <= c && c <= '\uDFFF') return true;
        }
        return false;
    }