Search code examples
sql-servert-sqlunicodesql-like

LIKE ignores magic Unicode character


SQL Server 2008 R2

 create table #test (c1 nvarchar(5) not null)

  insert into #test values 
  (N'aaa'),
  (nchar(65533)),
  (N'bbb')

  select * from #test where c1 like N'%�%'
  select * from #test where c1 like N'%'+nchar(65533)+N'%'

Result is

c1
----
aaa
�
bbb

Why? I haven't found in MSDN any notes on this case.


Solution

  • That character (and many others, depending on the Collation version being used) just happens to have no sort weight defined. It is effectively nothing. So whether you have 1 instance of it or 100, it is invisible to anything but binary Collations. Meaning, the following WHERE predicates:

    LIKE N'%' +  NCHAR(0xFFFD) + N'%'
    
    LIKE N'%' +  NCHAR(0xFFFD) +  NCHAR(0xFFFD) + N'%'
    
    LIKE N'%' +  NCHAR(0xFFFD) +  NCHAR(0xFFFD) +  NCHAR(0xFFFD) + N'%'
    

    and so on, are all equivalent to the following:

    LIKE N'%%'
    

    And that is why you get all 3 rows returned.

    This does not mean that this character should have no sort weight. It actually is defined in Unicode as having a weight, but for some reason Microsoft has left quite a few characters without any sort weight at all (though the total number of characters missing a sort weight is decreasing with each new Collation version, the newest being the version 140 Collations, which came with SQL Server 2017 and are only for the Japanese Collations).

    For any character that has no sort weight, the only way to match it is by using a binary Collation. Binary Collations are those ending in either _BIN or _BIN2, but only use the _BIN2 Collations since they sort properly whereas the older _BIN Collations do not. For example:

    SELECT * FROM #test WHERE c1 LIKE N'%�%' COLLATE Latin1_General_100_BIN2;
    

    Returns:

    c1
    ----

    Also, I tested with the following and they returned all 3 rows:

    • Latin1_General_CS_AS_KS_WS
    • Latin1_General_100_CS_AS_KS_WS_SC

    So, the following should be good:

    • Latin1_General_BIN2
    • Latin1_General_100_BIN2

    Also, it is best to use the highest Collation version available for the Collation you are trying to use. For example, use Latin1_General_100_* instead of Latin1_General_*, and so on. Use the following query to find the Collations available on your Instance:

    SELECT col.*
    FROM   sys.fn_helpcollations() col
    ORDER BY col.[name];