Search code examples
sql-serverunicodeencodingcollationucs2

NVARCHAR storing characters not supported by UCS-2 encoding on SQL Server


By SQL Server's documentation (and legacy documentation), a nvarchar field without _SC collation, should use the UCS-2 ENCODING.

Starting with SQL Server 2012 (11.x), when a Supplementary Character (SC) enabled collation is used, these data types store the full range of Unicode character data and use the UTF-16 character encoding. If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.

It also states that the UCS-2 ENCODING stores only the subset characters supported by UCS-2. From wikipedia UCS-2 specification:

UCS-2, uses a single code value [...] between 0 and 65,535 for each character, and allows exactly two bytes (one 16-bit word) to represent that value. UCS-2 thereby permits a binary representation of every code point in the BMP that represents a character. UCS-2 cannot represent code points outside the BMP.

So, by the specifications above, seems that I won't be able to store a emoji like: 😍 which have a value of 0x1F60D (or 128525 in decimal, way above 65535 limit of UCS-2). But on SQL Server 2008 R2 or SQL Server 2019 (both with the default SQL_Latin1_General_CP1_CI_AS COLLATION), on a nvarchar field, it's perfectly stored and returned (although not supported on comparisons with LIKE or =):

enter image description here

SMSS doesn't render emoji correctly, but here is the value copied and pasted from query result: 😍

So my questions are:

  1. Is nvarchar field really using USC-2 on SQL Server 2008 R2 (I also tested on SQL Server 2019, with same non _SC collations and got same results)?

  2. Is Microsoft's documentation of nchar/nvarchar misleading about "then these data types store only the subset of character data supported by the UCS-2 character encoding"?

  3. Does UCS-2 ENCODING support or not code points beyond 65535?

  4. How SQL Server was able to correctly store and retrieve this field's data, when it's outside the support of UCS-2 ENCODING?

NOTE: Server's Collation is SQL_Latin1_General_CP1_CI_AS and Field's Collation is Latin1_General_CS_AS.
NOTE 2: The original question stated tests about SQL Server 2008. I tested and got same results on a SQL Server 2019, with same respective COLLATIONs.
NOTE 3: Every other character I tested, outside UCS-2 supported range, is behaving on the same way. Some are: 𝕂, 😂, 𨭎, 𝕬, 𝓰


Solution

  • There are several clarifications to make here regarding the MS documentation snippets posted in the question, and for the sample code, for the questions themselves, and for statements made in the comments on the question. Most of the confusion can be cleared up, I believe, by the information provided in the following post of mine:

    How Many Bytes Per Character in SQL Server: a Completely Complete Guide

    First things first (which is the only way it can be, right?): I'm not insulting the people who wrote the MS documentation as SQL Server alone is a huge product and there is a lot to cover, etc, but for the moment (until I get a chance to update it), please read the "official" documentation with a sense of caution. There are several misstatements regarding Collations / Unicode.

    1. UCS-2 is an encoding that handles a subset of the Unicode character set. It works in 2-byte units. With 2 bytes, you can encode values 0 - 65535. This range of code points is known as the BMP (Basic Multilingual Plane). The BMP is all of the characters that are not Supplementary Characters (because those are supplementary to the BMP), but it does contain a set of code points that are exclusively used to encode Supplementary Characters in UTF-16 (i.e. the 2048 surrogate code points). This is a complete subset of UTF-16.

    2. UTF-16 is an encoding that handles all of the Unicode character set. It also works in 2-byte units. In fact, there is no difference between UCS-2 and UTF-16 regarding the BMP code points and characters. The difference is that UTF-16 makes use of those 2048 surrogate code points in the BMP to create surrogate pairs which are the encodings for all Supplementary Characters. While Supplementary Characters are 4-bytes (in UTF-8, UTF-16, and UTF-32), they are really two 2-byte code units when encoding in UTF-16 (likewise, they are four 1-byte units in UTF-8, and one 4-byte in UTF-32).

    3. Since UTF-16 merely extends what can be done with UCS-2 (by actually defining the usage of the surrogate code points), there is absolutely no difference in the byte sequences that can be stored in either case. All 2048 surrogate code points used to create Supplementary Characters in UTF-16 are valid code points in UCS-2, they just don't have any defined usage (i.e. interpretation) in UCS-2.

    4. NVARCHAR, NCHAR, and the deprecated-so-do-NOT-use-it-NTEXT datatypes all store Unicode characters encoded in UCS-2 / UTF-16. From a storage perspective there is absolutely NO difference. So, it doesn't matter if something (even outside of SQL Server) says that it can store UCS-2. If it can do that, then it can inherently store UTF-16. In fact, while I have not had a chance to update the post linked above, I have been able to store and retrieve, as expected, emojis (most of which are Supplementary Characters) in SQL Server 2000 running on Windows XP. There were no Supplementary Characters defined until 2003, I think, and certainly not in 1999 when SQL Server 2000 was being developed. In fact (again), UCS-2 was only used in Windows / SQL Server because Microsoft pushed ahead with development prior to UTF-16 being finalized and published (and as soon as it was, UCS-2 became obsolete).

    5. The only difference between UCS-2 and UTF-16 is that UTF-16 knows how to interpret surrogate pairs (comprised of a pair of surrogate code points, so at least they're appropriately named). This is where the _SC collations (and, starting in SQL Server 2017, also version _140_ collations which include support for Supplementary Characters so none of them have the _SC in their name) come in: they allow the built-in SQL Server functions to correctly interpret Supplementary Characters. That's it! Those collations have nothing to do with storing and retrieving Supplementary Characters, nor do they even have anything to do with sorting or comparing them (even though the "Collation and Unicode Support" documentation says specifically that this is what those collations do — another item on my "to do" list to fix). For collations that have neither _SC nor _140_ in their name (though the new-as-of-SQL Server 2019 Latin1_General_100_BIN2_UTF8 might be grey-area, at least, I remember there being some inconsistency either there or with the Japanese_*_140_BIN2 collations), the built-in functions only handle BMP code points (i.e. UCS-2).

    6. Not "handling" Supplementary Characters means not interpreting a valid sequence of two surrogate code points as actually being a singular supplementary code point. So, for non-"SC" collations, BMP surrogate code point 1 (B1) and BMP surrogate code point 2 (B2) are just those two code points, neither one of which is defined, hence they appear as two "nothing"s (i.e. B1 followed by B2). This is why it is possible to split a Supplementary Character in two using SUBSTRING / LEFT / RIGHT because they won't know to keep those two BMP code points together. But an "SC" collation will read those code points B1 and B2 from disk or memory and see a single Supplementary code point S. Now it can be handled correctly via SUBSTRING / CHARINDEX / etc.

    7. The NCHAR() function (not the datatype; yes, poorly named function ;) is also sensitive to whether or not the default collation of the current database supports Supplementary Characters. If yes, then passing in a value between 65536 and 1114111 (the Supplementary Character range) will return a non-NULL value. If not, then passing in any value above 65535 will return NULL. (Of course, it would be far better if NCHAR() just always worked, given that storing / retrieving always works, so please vote for this suggestion: NCHAR() function should always return Supplementary Character for values 0x10000 - 0x10FFFF regardless of active database's default collation ).

    8. Fortunately, you don't need an "SC" collation to output a Supplementary Character. You can either paste in the literal character, or convert the UTF-16 Little Endian encoded surrogate pair, or use the NCHAR() function to output the surrogate pair. The following works in SQL Server 2000 (using SSMS 2005) running on Windows XP:

      SELECT N'💩', -- 💩
      CONVERT(VARBINARY(4), N'💩'), -- 0x3DD8A9DC
      CONVERT(NVARCHAR(10), 0x3DD8A9DC), -- 💩 (regardless of DB Collation)
      NCHAR(0xD83D) + NCHAR(0xDCA9) -- 💩 (regardless of DB Collation)
      

      For more details on creating Supplementary Characters when using non-"SC" collations, please see my answer to the following DBA.SE question: How do I set a SQL Server Unicode / NVARCHAR string to an emoji or Supplementary Character?

    9. None of this affects what you see. If you store a code point, then it's there. How it behaves — sorting, comparison, etc — is controlled by collations. But, how it appears is controlled by fonts and the OS. No font can contain all characters, so different fonts contain different sets of characters, with a lot of overlap on the more widely used characters. However, if a font has a particular byte sequence mapped, then it can display that character. This is why the only work required to get Supplementary Characters displaying correctly in SQL Server 2000 (using SSMS 2005) running on Windows XP was to add a font containing the characters and doing one or two minor registry edits (no changes to SQL Server).

    10. Supplementary Characters in SQL_* collations and collations without a version number in their name have no sort weights. Hence, they all equate to each other as well as to any other BMP code points that have no sort weights (including "space" (U+0020) and "null" (U+0000)). They started to fix this in the version _90_ collations.

    11. SSMS has nothing to do with any of this, outside of possibly needing the font used for the query editor and/or grid results and/or errors + messages changed to one that has the desired characters. (SSMS doesn't render anything outside of maybe spatial data; characters are rendered by the display driver + font definitions + maybe something else).

    Therefore, the following statement in the documentation (from the question):

    If a non-SC collation is specified, then these data types store only the subset of character data supported by the UCS-2 character encoding.

    is both nonsensical and incorrect. They were probably intending to say the datatypes would only store a subset of the UTF-16 encoding (since UCS-2 is the subset). Also, even if it said "UTF-16 character encoding" it would still be wrong because the bytes that you pass in will be stored (assuming enough free space in the column or variable).