Search code examples
c#sql-server-ce

Why is Sql CE altering my unicode value?


I'm ingesting unstructured text of various documents from various sources. I'm using Sql Server Compact Edition 3.5.

I'm creating a lookup table with unique words and referencing them via an ID as an identity column. What I'm finding a problem with seems to be unicode.

Upon inserting the term "definitions" I'm hitting the unique key constraint against a row that already has "definitions". If you inspect the two closely, you'll find that the value I'm inserting, that is not an 'f' and an 'i', but in fact a single character. Yet, SqlCe is trying to convert it to being and 'f' and an 'i'. Ten characters in one word, eleven in the other, but SqlCe sees them as the same.

The table column is nvarchar.

I specified the parameter as nvarchar.

The query is very straightforward:

            cmd.CommandText = "INSERT INTO dictionary(lemma) VALUES(?);";

            DbParameter lemma = cmd.CreateParameter();
            cmd.Parameters.Add(lemma);

            for (int i = 0; i < terms.Count; i++)
            {
                lemma.Value = terms[i].Key;
                cmd.ExecuteNonQuery();
            }

I've also tried:

            cmd.CommandText = "INSERT INTO dictionary(lemma) VALUES(?);";

            SqlCeParameter lemma = new SqlCeParameter("lemma", SqlDbType.NVarChar);
            cmd.Parameters.Add(lemma);

            for (int i = 0; i < terms.Count; i++)
            {
                lemma.Value = terms[i].Key;
                cmd.ExecuteNonQuery();
            }

In what I'm inserting, the bytes for the 'fi' character are 1 251, as opposed to 'f' and 'i' 105 0, 110 0. See the following:

 {byte[20]}
 [0]: 100
 [1]: 0
 [2]: 101
 [3]: 0
 [4]: 1
 [5]: 251
 [6]: 110
 [7]: 0
 [8]: 105
 [9]: 0
 [10]: 116
 [11]: 0
 [12]: 105
 [13]: 0
 [14]: 111
 [15]: 0
 [16]: 110
 [17]: 0
 [18]: 115
 [19]: 0

Whereas the value in the database (the one SqlCe is seeing as a violation of a unique key) is:

{byte[22]}
[0]: 100
[1]: 0
[2]: 101
[3]: 0
[4]: 102
[5]: 0
[6]: 105
[7]: 0
[8]: 110
[9]: 0
[10]: 105
[11]: 0
[12]: 116
[13]: 0
[14]: 105
[15]: 0
[16]: 111
[17]: 0
[18]: 110
[19]: 0
[20]: 115
[21]: 0

How can I get SQL Server CE to insert the value correctly?

EDIT: Corrected the code shown above.


Solution

  • SQL CE is not altering any of your values. The problem is you have defined the column as IDENTITY, which only allows unique values, and SQL CE determines uniqueness of VARCHARS by using collation rules. The default is to match strings culturally-aware, so 'fi' = 'fi', 'Å' = 'Å', and many others.

    I do not know of any SQL collation that compares each Unicode value as distinct. If this is really what you want, you will have to store your data as VarBinary and perform binary comparisons.

    Reconsider your desire to compare characters as binary, SQL defines VarChar as human-readable text, and both SQL and Unicode specifies corresponding ligatures, diacritics, et al. as matching strings. This makes sense, people really do read them the same, and they are indistinguishable in most fonts.