Search code examples
sql-servert-sqldatabase-performance

COLLATE clause cannot be used on expressions containing a COLLATE clause


I have two tables: tab1 and tab2. Every table has only one VARCHAR(MAX) column.

I need to obtain only those pairs of values, which are equals, but differs in only case.

Example input:

tab1.t1     tab2.t2
-----------------------
fff     fff
FFF     fff
Fff     fff
FFF     FFA
FfA     FFF
FFF     aaa
bbb     aaa

Related output:

t1      t2
-----------------------
fff     FFF
FFF     fff
Fff     fff
Fff     FFF
FfA     FFA

Tables tab1 and tab2 are big enough in real our real database (~ 800-1000 rows). And there are about 500-600 columns for which I need to produce this operation.

So I need to write a fast solution. I write an alghoritm:

  1. remove all duplicates using case sensitive collation from tab1
  2. remove all duplicates using case sensitive collation from tab2
  3. join result sets from previous steps using case insensitive collation
  4. filter out (WHERE clause) rows which values are not equals using case sensitive collation
  5. remove duplicate rows using case sensitive collation

I tried so:

SELECT DISTINCT tt.t1 COLLATE Cyrillic_General_CS_AS, tt.t2 COLLATE Cyrillic_General_CS_AS
FROM (
    SELECT tt1.t1, tt2.t2
    FROM 
    (
        SELECT tab1.t1 COLLATE Cyrillic_General_CS_AS
                       AS t1
        FROM (VALUES('fff'),('FFF'),('Fff'),('FFF'),('FfA'),('FFF'),('bbb')) AS tab1(t1)
        GROUP BY tab1.t1 COLLATE Cyrillic_General_CS_AS
    ) tt1 INNER JOIN
    (
        SELECT tab2.t2 COLLATE Cyrillic_General_CS_AS
                       AS t2
        FROM (VALUES('fff'),('fff'),('fff'),('FFA'),('FFF'),('aaa'),('aaa')) AS tab2(t2)
        GROUP BY tab2.t2 COLLATE Cyrillic_General_CS_AS
    ) tt2
    ON tt1.t1 = tt2.t2 COLLATE Cyrillic_General_CI_AS
) AS tt
WHERE tt.t1 <> tt.t2 COLLATE Cyrillic_General_CS_AS

but it an error occur:

"COLLATE clause cannot be used on expressions containing a COLLATE clause."

Please suggest me how to avoid this problem without using user defined functions, temp tables, or removing group by clause (i tried them - they hardly slow down execution).


Solution

  • Because SQL Server can process JOINs in any order and either

    1. carry forward expressions to be evaluated in the final output
    2. resolve expressions very early, as early as streamlining into the initial retrieval from the data

    It cannot guarantee the stage at which it will force the application of the COLLATE. Because of that, you can only apply it ONCE to any column/expression. Since you already have DISTINCT in the outer query, the GROUP BYs are superfluous in the derived tables - even if you feel that it will cut down on the intermediate result sets.


    SELECT DISTINCT tt.t1 COLLATE Cyrillic_General_CS_AS, tt.t2 COLLATE Cyrillic_General_CS_AS
    FROM (
        SELECT tt1.t1, tt2.t2
        FROM 
            (VALUES('fff'),('FFF'),('Fff'),('FFF'),('FfA'),('FFF'),('bbb')) tt1(t1)
        INNER JOIN
            (VALUES('fff'),('fff'),('fff'),('FFA'),('FFF'),('aaa'),('aaa')) tt2(t2)
        ON tt1.t1 = tt2.t2
    ) AS tt
    WHERE tt.t1 <> tt.t2 COLLATE Cyrillic_General_CS_AS