I need to detect surrogate pairs in hundreds of thousands of strings of various lengths.
I wonder what might be the fastest and most efficient way to do this.
string srcString = "𝜺𝜺 = 𝟏";
var hasSurrogate = srcString.Any(c => '\uD800' <= c && c <= '\uDFFF');
Any
works fine with its O(n) speed, but maybe there is a more efficient way to do this
The best thing you can do is measure. Have a couple implementations and benchmark them. I've compared using Any()
vs. for
. A simple for
cycle is faster by a factor of 10.
Method | Mean | Error | StdDev |
---|---|---|---|
HasSurrogateAny | 8,095.7 ns | 160.38 ns | 230.01 ns |
HasSurrogateFor | 765.8 ns | 14.19 ns | 13.27 ns |
I've used BenchmarkDotNet with this simple code:
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace PerfTests.Console;
public class Program
{
public static void Main(string[] args)
{
var summary = BenchmarkRunner.Run<SurrogateBenchmark>();
}
}
public class SurrogateBenchmark
{
private static string test = "".PadRight(1000) + "𝜺𝜺 = 𝟏";
[Benchmark]
public bool HasSurrogateAny()
{
return test.Any(c => '\uD800' <= c && c <= '\uDFFF');
}
[Benchmark]
public bool HasSurrogateFor()
{
for (int i = 0; i < test.Length; i++)
{
char c = test[i];
if ('\uD800' <= c && c <= '\uDFFF') return true;
}
return false;
}
}
Note that I've prepended 1000 spaces before the actual surrogate, so that the algorithms have some work to do and do not encounter a surrogate as the very first character in the string.
Environment: .NET 6.0.7, Intel Xeon Gold 16-core 2.4 GHz, WS2019 virtual machine
It turns out that a foreach
offers essentially the same performance, as well as a more readable version which uses Char.IsSurrogate()
:
Method | Mean | Error | StdDev |
---|---|---|---|
HasSurrogateAny | 8,630.1 ns | 134.03 ns | 125.37 ns |
HasSurrogateForEach | 770.1 ns | 10.31 ns | 9.14 ns |
HasSurrogateForReadable | 796.0 ns | 13.41 ns | 12.54 ns |
HasSurrogateFor | 773.2 ns | 11.46 ns | 10.72 ns |
And, finally, in my setup a version utilizing a for
over a Span<char>
is actually a bit slower:
Method | Mean | Error | StdDev |
---|---|---|---|
HasSurrogateFor | 764.1 ns | 12.99 ns | 11.52 ns |
HasSurrogateForSpan | 803.7 ns | 15.95 ns | 21.30 ns |
Code for the remaining three tests:
[Benchmark]
public bool HasSurrogateForEach()
{
foreach (char c in test)
{
if ('\uD800' <= c && c <= '\uDFFF') return true;
}
return false;
}
[Benchmark]
public bool HasSurrogateForReadable()
{
for (int i = 0; i < test.Length; i++)
{
char c = test[i];
if (Char.IsSurrogate(c)) return true;
}
return false;
}
[Benchmark]
public bool HasSurrogateForSpan()
{
var s = test.AsSpan();
for (int i = 0; i < s.Length; i++)
{
char c = test[i];
if ('\uD800' <= c && c <= '\uDFFF') return true;
}
return false;
}