I'm using C# and .NET 3.5, trying to import some data from old dbf files using ODBC with Microsoft dBase Driver.
The dbf's are in dBase III format and using ibm850 encoding for strings.
Now, when I run my program on my machine, all string data read from OdbcDataReader comes out converted to UTF-16 or UTF-8 or something, idk and I save it as UTF-8 and everything is ok, but when I try to use this program on an XP box, some characters aren't converted correctly to UTF-8. 'Õ' for example. There may be some others too. Characters like 'Ä', 'Ö' and 'Ü' are ok. This is the problem. Maybe the ODBC or the driver uses some machine culture info or something to mess everything up.
Is it possible to read strings from the database as binary? Maybe some functions like CONVERT or CAST? Or where could I find some references for SQL functions and syntax which works for this dBase driver or other drivers? I searched around and couldn't find anything. I feel so blind when using ODBC and SQL.
Right now I'm using a temporary hack that replaces all σ's with Õ's.
Thanks!
Example code:
System.Data.Odbc.OdbcConnection oConn = new System.Data.Odbc.OdbcConnection();
oConn.ConnectionString = @"Driver={Microsoft dBase Driver (*.dbf)};DriverID=277;Dbq=" + dbPath + ";";
oConn.Open();
System.Data.Odbc.OdbcCommand oCmd = oConn.CreateCommand();
oCmd.CommandText = @"SELECT name FROM " + dbPath + "TABLE.DBF";
System.Data.Odbc.OdbcDataReader reader = oCmd.ExecuteReader();
reader.Read();
byte[] buf = Encoding.UTF8.GetBytes(reader.GetString(0));
BinaryWriter writer = new BinaryWriter(File.Open(@"C:\DBF\Test.txt", FileMode.Create));
writer.Write(buf);
Result:
E5 in dbf (Õ in 850)
Test.txt on pc1: C3 95 (Õ in UTF-8)
Test.txt on pc2: CF 83 (σ in UTF-8)
If you are still having a problem with these files, I may be able to help you.
What is in the "codepage byte" aka "language driver id" (LDID) at offset 29 (decimal) in the file?
I have a Python-based DBF reader which can read just about any field data type and just about any codepage -- it has a long list compiled from various sources of mappings from codepage byte to codepage number. Options are (1) believe the LDID, deliver Unicode (2) ignore the LDID, deliver undecoded bytes (3) override the LDID, decode with a specific codepage into Unicode. The Unicode can of course be then encoded into UTF-8.
The DBF reader also does a whole lot of reasonableness cross-checks which may help investigating why VFP thinks the file is corrupt.
How do you know that it's using IBM850? Another piece of Python code that I have is a prototype encoding detector, which unlike detectors like 'chardet' which are derived from Mozilla code is not web-centric and can happily recognise most old DOS codepages -- this may help.
A observation: the Greek letter lowercase sigma (σ) is 0xE5 in codepage 437, which was succeded by codepage 850 -- "pc2" seems a little outdated ...
If you think I can be of any help, feel free to e-mail me at insert_punctuation("sjmachin", "lexicon", "net")