Search code examples
sql-servert-sqlcollationsqldatatypes

Collation and datatype incompatibility on strings


I am very confused on the behavior of the system on the case where both collation and datatype differences are involved.

As a minimal example, I am inputting the same Unicode value to the single column of two different tables. In one table the column is varchar and of a certain collation, and on the other it's nvarchar and of another collation. Code and results:

create table cn(code nvarchar(max) collate Latin1_General_CI_AS)
create table cv(code varchar(max) collate SQL_Latin1_General_CP1253_CI_AI)

insert cn select N'3VT18021δ'
insert cv select N'3VT18021δ'

select * from cn
select * from cv

--1. 
select * from cn inner join cv on cn.code=cv.code 
-- Cannot resolve the collation conflict between "SQL_Latin1_General_CP1253_CI_AI" and "Latin1_General_CI_AS" in the equal to operation.

--2. 
select * from cn inner join cv on cn.code=cv.code collate SQL_Latin1_General_CP1253_CI_AI   
-- returns one row

--3. 
select * from cn inner join cv on cn.code =cv.code collate Latin1_General_CI_AS 
-- returns 0 rows

--4. 
select * from cn inner join cv on cn.code collate SQL_Latin1_General_CP1253_CI_AI =cv.code   
-- returns one row

--5. 
select * from cn inner join cv on cn.code collate Latin1_General_CI_AS =cv.code 
-- returns one row

My notes:

Case 1: collation difference, I understand

Cases 2 and 5: return (correctly) one row. Why does collating a field to its own collation do any good?

Cases 3 and 4: Why converting one's collation to the other works one time, but not the other?

Of course, all these get further complication from the datatype difference.


Solution

  • Cases 2 and 5: return (correctly) one row. Why does collating a field to its own collation do any good?

    When you explicitly use COLLATE on a value in a clause both sides of the expression are explicitly converted to that collation, thus there is no conflict.

    Cases 3 and 4: Why converting one's collation to the other works one time, but not the other?

    One of your columns is a varchar, so when it's changed from one collation to the other, its value changes. This is, specifically, when you COLLATE the value in your table cv to the collation Latin1_General_CI_AS. As 'δ' isn't a character available in the collation for a varchar, it changes to a 'd' and '3VT18021d' does not equal N'3VT18021δ'. You can see this with the below:

    SELECT code COLLATE Latin1_General_CI_AS
    FROM cv;
    

    You would need to explicitly convert the value to a nvarchar first:

    select *
    from cn
         inner join cv on cn.code = CONVERT(nvarchar(MAX),cv.code) collate Latin1_General_CI_AS;
    --Returns one row now
    

    Edit: To explain why Query 3 does not return data, and Query 5 does, this is because of the positioning of the COLLATEs and when the implicit conversion happens.

    cn.code =cv.code collate Latin1_General_CI_AS --3
    cn.code collate Latin1_General_CI_AS =cv.code --5
    

    For Query 3, the COLLATE expression is on cv.code, which is the varchar. As a result the value has it's collation changed first and the character 'δ' is lost. Then it is implicitly converted to an nvarchar, due to data type precedence.

    For Query 5, however, the COLLATE is on cn.code the nvarchar. As a result when the value's collation is changed no characters are lost. As cv.code doesn't have an explicit COLLATE, it is instead first converted to an nvarchar (due to data type precendence) and then collated; causing no loss of characters.