Search code examples
sqlsql-servercollationcharindex

CHARINDEX returns the wrong result in some COLLATIONs when searching for thorn (character 254)


Overview

CHARINDEX is returning the wrong value occasionally when using a collation sequence like:

Latin1_General_CI_AS 

but works with a collation sequence like:

SQL_Latin1_General_CP1_CI_AS

This has been encountered on MS SQL Server 2008 R2 and SQL Server 2016.

Examples

Assume the database collation sequence is:

Latin1_General_CI_AS
  • print CHARINDEX( CHAR(254), 'Tþ' ) -- returns 2 which is Correct
  • print CHARINDEX( CHAR(254), 'Th' ) -- returns 1 which is WRONG
  • print CHARINDEX( CHAR(253), 'Th' ) -- returns 0 which is Correct
  • print CHARINDEX( CHAR(254) Collate SQL_Latin1_General_CP1_CI_AS, 'Thþ' Collate SQL_Latin1_General_CP1_CI_AS) -- returns 3 which is Correct
  • print CHARINDEX( CHAR(254) Collate Latin1_General_CI_AS, 'Thþ' Collate Latin1_General_CI_AS) -- returns 1 which is WRONG

Is there a known error with the Latin1... collation sequences?


Solution

  • This isn't anything specific to SQL Server.

    In C#

    string.Compare("þ", "th", false, new System.Globalization.CultureInfo(1033))
    

    returns 0 indicating the strings compare equal.

    Or in notepad clicking "Replace all" below

    enter image description here

    leads to

    enter image description here

    In SQL Server collations that do not start with "SQL" use windows collation rules.

    For those in most locales (Iceland being an exception) the thorn character þ expands to th.

    There is more information about this specific case in this post by Michael S. Kaplan Every rose has it's Þ..... That blog has a wealth of information about windows collations. Expansions are described in more detail here.

    If you don't want these semantics you need to use a collation (likely SQL or binary) with the semantics that you do want (perhaps via an explicit collate clause).