This is purely theoretical question to wrap my head around
Let's say I have Unicode
cyclone (🌀 1F300
) symbol. If I try to store it in varchar
column that has default Latin1_General_CI_AS
collation, cyclone symbol cannot not fit into one byte that is used per symbol in varchar
...
The ways I can see this done:
I have done some research after inserting couple of different unicode symbols
INSERT INTO [Table] (Field1)
VALUES ('👽')
INSERT INTO [Table] (Field1)
VALUES ('🌀')
and then reading them as bytes SELECT
cast (field1 as varbinary(10))
in both cases I got 0x3F3F
.
3F
in ascii
is ?
(question mark) e.g two question marks (??
) that I also see when doing normal select *
does that mean that data is toast and not even 1st bite is being stored?
How is Unicode data that is out of collation stored in varchar column?
The data is toast and is exactly what you see, 2 x 0x3F bytes. This happens during the type conversion prior to the insert and is effectively the same as cast('👽' as varbinary(2))
which is also 0xF3F3 (as opposed to casting N'👽'
).
When Unicode data must be inserted into non-Unicode columns, the columns are internally converted from Unicode by using the WideCharToMultiByte API and the code page associated with the collation. If a character cannot be represented on the given code page, the character is replaced by a question mark (?) Ref.