Search code examples
unicodeencodingoledbinformix

Encoding problem when migrating a legacy tool to a new Windows version with OleDB


We have an odd problem after migrating an old application (>15 yo) written in c# to a new Windows Server. The application uses OleDB to connect to the database which is an Informix database. This database has a table with texts in multiple languages. The application running in a Windows 2003 server works just fine, however in the new Windows 2016 it raises the error: "The data value could not be converted for reasons other than sign mismatch or data overflow. For example, the data was corrupted in the data store but the row was still retrievable."

After some investigation we have found the problem to be in a string that has some unicode characters in it.

This is the part of the text that is generating the problem (only part of the text to illustrate the problem:

"17"-Leichtmetallräder ...... Ziffern - Schaltknauf"

This is a German text and seems ok, the problem is actually with the "-". Looking at the db record in Hex, the first "-" is coded as "3F", however the second dash is coded as "C296", which corresponds to U+0096 (a dash in unicode)

The settings for the DB is en_US.819 (which corresponds to ISO-8859-1 to support for all languages that need to be supported).

Now, the problem is that when running the program in Windows 2003 the result is written in a file correctly like:

"17"-Leichtmetallräder ...... Ziffern - Schaltknauf"

However in Windows 2016 the exception above is raised and nothing gets written.

I worked on some code changes, the first thing I did was to change OleDB for Odbc connection and the exception dissapeared, however the text in the output is incorrect:

"17"-Leichtmetallräder ...... Ziffern ? Schaltknauf"

Notice how the same code with odbc connection is unable to understand the unicode dash.

This is the OleDB code that works in Windows 2003:


    OleDbConnection ConnOleDbIDD = new OleDbConnection("Provider=Ifxoledbc.2;Data Source=db;INFORMIXSERVER=localhost;IFMX_UNDOC_B168163=1;");
    string sConnectTemplateDB = "Data Source=SQLServerDB;Initial Catalog=DB1; Connect Timeout = 28800; Integrated Security=True";

    ConnOleDbIDD.Open();

    sExportSQL = "SELECT * From MyTable";

    OleDbCommand cmdIDD = new OleDbCommand(sExportSQL, ConnOleDbIDD);
    cmdIDD.CommandTimeout = 28800;

    SqlDataAdapter da;
    ConnSchemaIDD = new SqlConnection (sConnectTemplateDB);
    ConnSchemaIDD.Open();
    SqlCommand cmdSQLServerTemplate = new SqlCommand(sExportSQL.Replace("TRIM","LTRIM"), ConnSchemaIDD);
    cmdSQLServerTemplate.CommandTimeout = 28800;
    da = new SqlDataAdapter(cmdSQLServerTemplate);


    OleDbDataReader dr;
    DataSet ds = new DataSet();
    da.MissingSchemaAction = MissingSchemaAction.AddWithKey;
    da.Fill(ds, sSourceTable);
    DataTable dt = ds.Tables[sSourceTable];
    dr = cmdIDD.ExecuteReader()

    iEnCodingFrom = 1252;
    iEnCodingTo = 1252;

    while (dr.Read())
    {
        sValue = "";
        sCurrentValue = "";
        bDelimiterPosition = false;
        foreach (DataColumn cCol in dt.Columns)
        {

            object oval = dr.GetValue(dr.GetOrdinal(cCol.ColumnName));
            string val = Convert.ToString(dr[cCol.ColumnName]);
            sCurrentValue = System.Text.Encoding.GetEncoding(iEnCodingTo).GetString(System.Text.Encoding.Convert(System.Text.Encoding.GetEncoding(iEnCodingFrom), System.Text.Encoding.GetEncoding(iEnCodingTo), System.Text.Encoding.GetEncoding(iEnCodingFrom).GetBytes(val)));

            if (bDelimiterPosition == true)
            {
                sValue = sValue + sDelimiter + sCurrentValue.Trim();
            }
            else
            {
                sValue = sValue + sCurrentValue.Trim();
            }

            bDelimiterPosition = true;

        }

        w.WriteLine(sValue);
        w.Flush();

    }
    dr.Close();

Assume for this example that "Mytable" has 2 columns, the first is an integer ID and the second is a char(3100).

As you see the code does some weird things like getting the column description from an schema of the table in a SQLServer database, and a conversion of the db output from CP1252 to CP1252. I am not sure why it was coded that way. My workaround for this problem has been doing these changes to the code (using odbc connection instead of oledb):


    iEnCodingFrom = 28591;
    ...
       sCurrentValue = Encoding.GetEncoding(iEnCodingTo).GetString(Encoding.GetEncoding(iEnCodingFrom).GetBytes(val.ToCharArray()));
    ...

So changing the connection to an ODBC conection to the informix DB which prevents the exception to be raised, and doing a conversion from codepage 28591 (8859-1) to 1252 (CP1252) produces in Windows 2016 the same result as the old code in Windows 2013.

So I have a workaround and can use it, however I would like to understand why this happens, why can't I keep using OleDB and if there is a way I can make it work in a new Windows environment (fails also in windows 10) without having to change the code.

Any help would be greatly appreciated.

Thank you


Solution

  • Thanks @LuísMarques and @jsagrera This is exactly the explanation I was looking for, so now I can understand the problem. In the article it is said that:

    "Since CSDK version 2.80, the ODBC driver is Unicode enabled, this means all the data the driver handles has to be in Unicode format. This means that a extra conversion has to be done".

    The version of the csdk in the old server is 2.71. The version in the new server is 4.10.

    Now, the "UNDOC" was there for that reason, the DB was created using en_us.819 but that was ignored with the "undoc" variable for my client app, which assumes the data comes in CP1252 and print it out in CP1252, without any internal conversion the program worked.

    But the data in the DB is corrupted anyway. After upgrading the driver that internal conversion being made produces the error.

    I still can work around it, I don't use the "UNDOC" in the ODBC connection, then I get the stream of bytes from the DB and do a conversion from 8859-1 to CP1252 in my C# code. That way I get exactly the same output as in the old server.

    However this is not a correct solution but a mitigation of the problem, and the final solution will be to change the DB to UTF8 to avoid any more problems. And that's what we will finally do.

    Thank you @jsagrera I'd like to mark your answer as the correct one. I'm new to the platform so I don't know well how it works. If you would post your comment as an answer I'd gladly upvote it of mark it as the correct if possible.