The following C# method counts string characters considering combining characters (Grapheme Clusters). Here it is:
public static class StringExtensions
{
public static SqlInt32 GetStrLength(this string input)
{
if (string.IsNullOrEmpty(input))
return 0;
return StringInfo.ParseCombiningCharacters(input).Length;
}
}
Now, I create a CLR from it to use inside SQL Server. Here is its code:
public static class UserDefinedFunctions
{
[SqlFunction(IsDeterministic = true, IsPrecise = true)]
public static SqlInt32 GetStrLength(SqlString input)
{
if (input.IsNull)
return 0;
return StringInfo.ParseCombiningCharacters(input.Value).Length;
}
}
The C# version works well, but in SQL Server, it doesn't count properly. What's the problem?
Here are a few examples where the SQLCLR function cannot count correctly:
SQLCLR version (Wrong) | non-SQLCLR version (Correct) |
---|---|
'👩🏻' -> 2 | '👩🏻' -> 1 |
'👨🏻❤️💋👩🏼' -> 9 | '👨🏻❤️💋👩🏼' -> 1 |
Here is the SQL code I have run to get the length:
SELECT dbo.GetStringLength(body) FROM notes;
And the following is the SQL code used to register the SQLCLR:
EXEC sp_configure 'show advanced options' , 1;
RECONFIGURE;
EXEC sp_configure 'clr enable' ,1;
RECONFIGURE;
EXEC sp_configure 'clr strict security', 0;
RECONFIGURE;
CREATE ASSEMBLY StringUtils FROM 'C:\GraphemeClusters.dll' WITH PERMISSION_SET = SAFE;
CREATE FUNCTION dbo.GetStringLength(@input NVARCHAR(MAX))
RETURNS INT
AS EXTERNAL NAME StringUtils.UserDefinedFunctions.GetStrLength;
The collation of the database shouldn't matter as the _SC
and _140_
collations only really affect the behavior of built-in functions, and only in relation to supplementary characters.
I tested your GetStrLength
method as well as StringInfo.LengthInTextElements
, in both SQL Server 2017 and 2022, using the following test string:
DECLARE @Input NVARCHAR(50) =
-- first two count individually as they do not have a base character
NCHAR(0x0303) + NCHAR(0x0303) +
-- next character counts as it is a base character
N'o' +
-- next four do not count as they all attach to the base character
NCHAR(0x0303) + NCHAR(0x0302) + NCHAR(0x0303) + NCHAR(0x0302);
and the expected value of 3 was returned in all cases ("0x0303" and "0x0302" are both combining characters).
The problem is that:
You are testing emoji "sequences", not combining characters,
and:
SQL Server is tied to .NET Framework, and the highest version, 4.8, does not handle emoji "sequences". Newer versions of .NET, however, do.
Hence, while you thought you were testing the non-SQLCLR version against .NET Framework 4.8 with combining characters, you were actually testing against a more recent version of .NET with emoji "sequences".
I executed the following in both LINQPad versions 5 and 8 (the test emoji sequence provided by the OP, minus the skin tone modifiers of U+1F3FB, can be found in the Unicode Emoji documentation as a "Multi-Person Gender" example):
string smooch =
"\U0001F469\U0001F3FB\u200D\u2764\uFE0F\u200D\U0001F48B\u200D\U0001F468\U0001F3FB";
System.Console.WriteLine(smooch);
System.Console.WriteLine(
System.Globalization.StringInfo.ParseCombiningCharacters(smooch)
.Length);
LINQPad 5, which works with .NET Framework 4.7 and 4.8, returns:
👩🏻❤💋👨🏻
9
The value of 9 comes from the Unicode code points that make up this emoji sequence, minus the emoji presentation selector, which does not count:
LINQPad 8, which works with .NET 5 through 9, returns (I tested against .NET 6.0.35 and 8.0.10):
👩🏻❤️💋👨🏻
1