Search code examples
vb.netutf-8ado.netsql-server-2019sqlclient

"Invalid data for UTF8-encoded characters" exception in SQL Server 2019


Environment: VB.NET, VS 2022 (latest build), SQL Server 2019 (latest patch)

I'm attempting to execute a simple INSERT of varchar data containing certain UTF-8 data, and regularly getting the SqlException "Invalid data for UTF8-encoded characters" when a single character is surrounded by left or right curly quotes (e.g. [left double quote]J[right double quote] -- like "J" with curly double quotes around it).

The error also occurs with curly single quotes, but DOES NOT occur with similar data not consisting of single characters not bounded in such a way (e.g. the contraction "isn[apostrophe]t" -- "isn't" with a curly apostrophe).

My code is similar to:

Dim SQL As String = "INSERT INTO MyTable ([MyField]) VALUES (@MyField)"
Dim curName As String

Using cmd As New SqlCommand(SQL, conn)
    cmd.Parameters.Add("@MyField", SqlDbType.VarChar, 80)

    While ...
       ...loop through filereader taking in UTF-8 encoded text data 

       curName = Await oRead.ReadLineAsync()
       cmd.Parameters("@MyField").Value = curName
       Try
          Await cmd.ExecuteNonQueryAsync()
       Catch ex As Exception
          Debug.Print("Error: " & ex.Message)  ' <--- Shows error here
       End Try
  End While
End Using

The column is set to varchar, with the database collation set to Latin1_General_100_CI_AS_SC_UTF8.

Any idea what's going, or if a workaround is possible?


Solution

  • I don’t fully understand the underlying issue — my best guess is that System.Data.SqlClient is old and simply doesn’t support UTF-8 collations — but I found a couple different workarounds:

    1. Declare your SqlParameter as NVarChar instead of VarChar:

      cmd.Parameters.Add("@MyField", SqlDbType.NVarChar, 80)
      

      Now the string will be successfully passed to SQL Server as UCS-2/UTF-16. SQL Server will then automatically convert the string to UTF-8 when it inserts the string into the table.

    2. Upgrade from System.Data.SqlClient to the Microsoft.Data.SqlClient NuGet package. (The Release Notes mention UTF-8 support.) If you go with this option, you must import SQL-related classes like SqlConnection and SqlCommand from the Microsoft.Data.SqlClient namespace instead of System.Data.SqlClient:

      'Imports System.Data.SqlClient
      Imports Microsoft.Data.SqlClient
      

    Analysis

    In the .NET Framework, the internal TdsParser class is responsible for sending queries to SQL Server in binary format. The class contains the following lines of code (2410-2411):

    _defaultCodePage = Int32.Parse(stringCodePage, NumberStyles.Integer, CultureInfo.InvariantCulture);
    _defaultEncoding = System.Text.Encoding.GetEncoding(_defaultCodePage);
    

    This suggests that for all Latin1_General collations, TdsParser encodes strings using code page 1252, which is wrong for UTF-8 collations and produces garbage bytes.