Search code examples
sql-serverreplaceunicodenvarcharsql-server-collation

SQL Server Unicode character replacement


I'm trying to remove/replace certain unicode symbol characters from text in SQL Server 2019.
The server/DB I'm working with has collation Latin1_General_CI_AS.
I can't change the collation of the server/DB so i tried this (and most of the time it works, but some symbols don't work).

Environment:
SQL Server version:
Microsoft SQL Server 2019 (RTM-GDR) (KB4583458) - 15.0.2080.9 (X64) Nov 6 2020 16:50:01 Copyright (C) 2019 Microsoft Corporation Developer Edition (64-bit) on Windows 10 Pro 10.0 (Build 19041: )

Server Collation: Latin1_General_CI_AS
Database Collation: Latin1_General_CI_AS

This example works as expected:

select replace(N'☋a' collate Latin1_General_100_CI_AI_SC, N'☋' collate Latin1_General_100_CI_AI_SC, N'XXX')
select replace(N'a☋' collate Latin1_General_100_CI_AI_SC, N'☋' collate Latin1_General_100_CI_AI_SC, N'XXX')

Output:
XXXa
aXXX

This example only works if the unicode symbol (⚶) is the first character, but not if another character stands before it:

select replace(N'⚶a' collate Latin1_General_100_CI_AI_SC, N'⚶' collate Latin1_General_100_CI_AI_SC, N'XXX')
select replace(N'a⚶' collate Latin1_General_100_CI_AI_SC, N'⚶' collate Latin1_General_100_CI_AI_SC, N'XXX')

Output:
XXXa (correct)
a⚶ (wrong)

Does anybody have an idea why the replacement works with some characters/character orders, and with some it doesn't?


Solution

  • I would suggest using a binary collation when dealing with such characters. The non-binary collations have a habit of not really "knowing what to do with" such characters like those in your question. Binary ones on the other hand work fine with them, as every character is treated as a distinctly different one:

    SELECT REPLACE(N'⚶a' COLLATE Latin1_General_100_BIN2, N'⚶' COLLATE Latin1_General_100_BIN2, N'XXX');
    SELECT REPLACE(N'a⚶' COLLATE Latin1_General_100_BIN2, N'⚶' COLLATE Latin1_General_100_BIN2, N'XXX');
    

    Note, however, that if you would be expecting N'⚶A' to be replaced, it won't be, as 'A' and 'a' are not equivalent in a binary collation.