It seems that SQL Server uses Unicode UCS-2, a 2-byte fixed-length character encoding, for nchar/nvarchar
fields. Meanwhile, C# uses Unicode UTF-16 encoding for its strings (note: Some people don't consider UCS-2 to be Unicode, but it encodes all the same code points as UTF-16 in the Unicode subset 0-0xFFFF, and as far as SQL Server is concerned, that's the closest thing to "Unicode" it natively supports in terms of character strings.)
While UCS-2 encodes the same basic code points as UTF-16 in the Basic Multilingual Plane (BMP), it doesn't reserve certain bit patterns that UTF-16 does to allow for surrogate pairs.
If I write a C# string to an SQL Server nvarchar
(UCS-2) field and read it back, will this always return the same result?
It seems that while UTF-16 is a superset of UCS-2 in the sense that UTF-16 encodes more code points (e.g. above 0xFFFF), it's actually a sub-set of UCS-2 at the 2-byte level, since it's more restrictive.
To answer my own question, I suspect that if my C# string contains code points above 0xFFFF (represented by pairs of characters), these would be stored and retrieved just fine in the database, but if I tried to manipulated them in the database (e.g. perhaps calling TOUPPER or attempting to blank-out every other character), then I could run into some problems displaying the string later... unless SQL Server has functions that acknowledge surrogate pairs and effectively treat nchar/nvarchar
strings as UTF-16.
It's all a bit of a fudge really.
First the similarities
nchar
/nvarchar
/ntext
data types store text as a string of 2-byte characters. It doesn't really care what you put in them until you come to do searching and sorting (then it uses the appropriate Unicode collation sequence).String
data type also stores text as a string of 2-byte Char
s. It also doesn't really care what you put in it until you come to do searching and sorting (then it uses the appropriate culture-specific methods).Now the differences
String
, it will always encode the string as UTF-16 (with full multilingual plane support).In short, as long as you treat both CLR and SQL Server string variables as whole blobs of text, then you can freely assign from one to the other with no loss of information. The underlying storage format is exactly the same, even though the abstractions layered on top are slightly different.