Search code examples
sql-serverunicodethai

Unable to return query Thai data


I have a table with columns that contain both Thai and English text data in NVARCHAR(255). In SSMS I can query the table and return all the rows easy enough. But if I then query specifically for one of the Thai results it returns no rows.

SELECT TOP 1000
    [Province]
    , [District]
    , [SubDistrict]
    , [Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]

Returns

Province    District    SubDistrict Branch 
อุตรดิตถ์   ลับแล   ศรีพนมมาศ   Northern
Bangkok  Khlong Toei    Khlong Tan  SSS1

But this query:

SELECT
    [Province]
    , [District]
    , [SubDistrict]
    , [Branch ]
FROM [THDocuworldRego].[dbo].[allDistricsBranches]
WHERE [Province] LIKE 'อุตรดิตถ์'

Returns no rows. What do I need o do to get the expected results. The collation set is Latin1_General_CI_AS. The data is displayed and inserted with no errors just can't search.


Solution

  • Two problems:

    1. The string being passed into the LIKE clause is VARCHAR due to not being prefixed with a capital "N". For example:

      SELECT 'อุตรดิตถ์' AS [VARCHAR], N'อุตรดิตถ์' AS [NVARCHAR]
      -- ?????????        อุตรดิตถ
      

      What is happening here is that when SQL Server is parsing the query batch, it needs to determine the exact type and value of all literals / constants. So it figures out that 12 is an INT and 12.0 is a NUMERIC, etc. It knows that N'ดิ' is NVARCHAR, which is an all-inclusive character set, so it takes the value as is. BUT, as noted before, 'ดิ' is VARCHAR, which is an 8-bit encoding, which means that the character set is controlled by a Code Page. For string literals and variables / parameters, the Code Page used for VARCHAR data is the Database's default Collation. If there are characters in the string that are not available on the Code Page used by the Database's default Collation, they are either converted to a "best fit" mapping, if such a mapping exists, else they become the default replacement character: ?.

      Technically speaking, since the Database's default Collation controls string literals (and variables), and since there is a Code Page for "Thai" (available in Windows Collations), then it would be possible to have a VARCHAR string containing Thai characters (meaning: 'ดิ', without the "N" prefix, would work). But that would require changing the Database's default Collation, and that is A LOT more work than simply prefixing the string literal with "N".

      For an in-depth look at this behavior, please see my two-part series:

    2. You need to add the wildcard characters to both ends:
      N'%อุตรดิตถ์%'

    The end result will look like:

    WHERE [Province] LIKE N'%อุตรดิตถ์%'
    

    EDIT:
    I just edited the question to format the "results" to be more readable. It now appears that the following might also work (since no wildcards are being used in the LIKE predicate in the question):

    WHERE [Province] = N'อุตรดิตถ์'
    

    EDIT 2:
    A string (i.e. something inside of single-quotes) is VARCHAR if there is no "N" prefixed to the string literal. It doesn't matter what the destination datatype is (e.g. an NVARCHAR(255) column). The issue here is the datatype of the source data, and that source is a string literal. And unlike a string in .NET, SQL Server handles 'string' as an 8-bit encoding (VARCHAR; ASCII values 0 - 127 same across all Code Pages, Extended ASCII values 128 - 255 determined by the Code Page, and potentially 2-byte sequences for Double-Byte Character Sets) and N'string' as UTF-16 Little Endian (NVARCHAR; Unicode character set, 2-byte sequences for BMP characters 0 - 65535, two 2-byte sequences for Code Points above 65535). Using 'string' is the same as passing in a VARCHAR variable. For example:

    DECLARE @ASCII VARCHAR(20);
    SET @ASCII = N'อุตรดิตถ์';
    SELECT @ASCII AS [ImplicitlyConverted]
    -- ?????????