Search code examples
sql-serverunicodesql-server-2008-r2normalizationunicode-normalization

Unicode normalization in SQL Server 2008 R2


@I have a row in a table which contains the following text "Urbański, Mariusz". The hex representation for character "ń" is "6e cc 81". So this is stored in a decomposition Unicode normalization form.

When I use a query like the following "...... where Identification = N'Urbański, Mariusz'" and character "ń" matches the decomposition form ("6e cc 81") query returns the expected records.

If I run the exact same query using a Composition Unicode Normalization Form ("ń" = "c5 84") I get no results.

I also tried "Select 1 Where N'Urbański, Mariusz' = N'Urbański, Mariusz'" where I use the 2 variations of "ń" which always returns true.

Is there a way to make SQL Server treat the 2 values as equal?

Here are my database configuration are requested by Rhys Jones

Database Collation : "Danish_Norwegian_CI_AS"
Column1 : IdRightsHolderSourceIdentification = NULL  
Column2 : VersionInfo = NULL  
Column3 : Source = "Danish_Norwegian_CI_AS"
Column4 : Identification = "SQL_Latin1_General_CP437_BIN"
Column5 : RightsHolder = NULL

The problematic column is, as Rhys Jones very well guessed, column4 and it has a binary collation (that's what BIN in the end means right?). Thanks a lot fo the assistance.


Solution

  • I can see from your question that you understand Unicode so I guess the bit you're missing is that in SQL Server there is something called collation. This is what determines how SQL Server compares values. I've put together a script to demonstrate various successful and unsuccessful comparisons between the two forms of the name. The SQL Server setup I'm currently using uses Latin1_General_CI_AS under which both forms of the name ARE equal. I've tried a few things but I can't make them not equal unless I use a binary collation. I'd be interested to know your server and database collations and to see the table definition (including collations) for your table.

    Hope this helps,

    Rhys

    -- UTF8 character list -- http://www.fileformat.info/info/charset/UTF-8/list.htm?
    --    n - LATIN SMALL LETTER N (U+006E) 6e 
    --    ́    - COMBINING ACUTE ACCENT (U+0301) cc81 
    --    ń - LATIN SMALL LETTER N WITH ACUTE (U+0144) c584 
    
    create table dbo.MyTable (id int not null, name nvarchar(100) collate Latin1_General_CI_AS not null)
    
    declare @a nvarchar(100); set @a = N'Urba' + nchar(0x0144) + N'ski, Mariusz'
    declare @b nvarchar(100); set @b = N'Urba' + nchar(0x6e) + nchar(0x0301) + N'ski, Mariusz'
    
    insert dbo.MyTable values (1, @a)
    insert dbo.MyTable values (2, @b)
    
    -- Display server, database and column collations
    select
        SERVERPROPERTY(N'Collation') as [server_collation],
        DATABASEPROPERTYEX(DB_NAME(), N'Collation') as [database_default_collation],
        c.collation_name as [column_collation]
    from
        sys.objects t join sys.columns c on c.object_id = t.object_id
    where 
        t.object_id = object_id('dbo.MyTable') and c.name = 'name'
    
    -- Test with Latin1_General_CI_AS
    select id, name from dbo.MyTable where name collate Latin1_General_CI_AS = @a collate Latin1_General_CI_AS
    
    -- Test with French_CI_AS
    select id, name from dbo.MyTable where name collate French_CI_AS = @a collate French_CI_AS
    
    -- Test with Latin1_General_BIN2
    select id, name from dbo.MyTable where name collate Latin1_General_BIN2 = @a collate Latin1_General_BIN2