Search code examples
sql-servercharacter-encodingutf-16ucs2codepoint

What are the consequences of storing a C# string (UTF-16) in a SQL Server nvarchar (UCS-2) column?


It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed-length character encoding, for nchar/nvarchar fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.)

While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to allow for surrogate pairs.

If I write a C# string to an SQL Server nvarchar (UCS-2) field and read it back, will this always return the same result?

It seems that while UTF-16 is a superset of UCS-2 in the sense that UTF-16 encodes more code points (e.g. above 0xFFFF), it's actually a sub-set of UCS-2 at the 2-byte level, since it's more restrictive.

To answer my own question, I suspect that if my C# string contains code points above 0xFFFF (represented by pairs of characters), these would be stored and retrieved just fine in the database, but if I tried to manipulated them in the database (e.g. perhaps calling TOUPPER or attempting to blank-out every other character), then I could run into some problems displaying the string later... unless SQL Server has functions that acknowledge surrogate pairs and effectively treat nchar/nvarchar strings as UTF-16.


Solution

  • It's all a bit of a fudge really.

    First the similarities

    • The SQL Server nchar/nvarchar/ntext data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).
    • The CLR String data type also stores text as a string of 2-byte Chars. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).

    Now the differences

    • .NET allows you to access the actual Unicode code points in a CLR string via the StringInfo class.
    • .NET has tons of support for encoding and decoding text data in a variety of encodings. When converting an arbitrary byte stream to a String, it will always encode the string as UTF-16 (with full multilingual plane support).

    In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.