In Excel VBA, I'm trying to read a CSV text file using ACE OLEDB Engine plus a schema.ini file. This works, but when there is an error in the data I expected the ACE OLEDB Engine to throw an exception.
However, it just reads error values in the data as null, passing them as empty values, and it simply continues without any error messages. Especially with large datasets this is unworkable, because there's just no way to easily spot any errors by hand.
Here is a tab separated data set, with some intentional errors to reproduce it.
OrderID OrderDate Amount Price ArtNo
40361 07/09/2019 2 59.95 4006633374668
40362 07/10/2019 2 42,95 4006633146647
40363 07/11/2019 1 69.75 4016987119501
40364 07/12/2019 4 4.99 4082300264630
40365 07/13/2019 1 39.95 4026736081348
40366 07/14/2019 1 230.00 4013872786831
40367 07/29/2019 1 42.95 4006633294256
40368 07/30/2019 1 3,299.00 GTIN:0191215072422
40369 07/31/2019 1 86.95 4010858791506
40370 07/32/2019 10 8.99 4029416288686
And the corresponding schema.ini file
[testdata.txt]
DateTimeFormat=m/d/yyyy
DecimalSymbol=.
Format=TabDelimited
Col1=OrderID Integer Width 5
Col2=OrderDate Date Width 10
Col3=Amount Integer Width 2
Col4=Price Float Width 10
Col5=ArtNo Text Width 13
After connecting to the text data file with ACE OLEDB and doing SELECT * FROM textfile
the result is this:
Some of the cells are empty, even though there are values in the data file.
B: value "42,95" is an error the DecimalSymbol
is set to point.
D: value "3,299.00" is an error because of the thousand comma
E: value "07/32/2019" is an incorrect date
F*: value "GTIN:0191215072422" too long, E9 is cut off at 13 chars (*not labeled in image)
(btw also the date values A are interpreted incorrectly, even though the setting DateTimeFormat=m/d/yyyy
is correct, yet C is correct probably just because the day is higher than 12, but Excel is always fidgety with dateformats so I'll just ignore those here)
Below is the VBA code to load a text data file using the ACE OLEDB Engine.
Sub ReadCsvTest(sPath As String, sFile As String, sBOM As String)
Dim Wsh As Worksheet
Dim AdoConnect As ADODB.Connection ' Tools > References > select "Microsoft ActiveX Data Objects x.x Library"
Dim AdoRcrdSet As ADODB.Recordset
Dim strSQL As String
Dim sField As String
Dim sValue As String
Dim i As Integer
Dim iRow As Integer
' not sure what it does, but just set sBOM = "\xEF\xBB\xBF"
' Add a new sheet
With ActiveWorkbook
Set Wsh = .Sheets.Add(After:=.Sheets(.Sheets.Count))
'Wsh.Name = NewSheetName 'rename new Sheet
End With
Set AdoConnect = New ADODB.Connection
AdoConnect.Provider = "Microsoft.ACE.OLEDB.12.0"
AdoConnect.ConnectionString = "Data Source=" & sPath & ";Extended Properties='text';"
AdoConnect.Open
strSQL = "select * from " & sFile
Set AdoRcrdSet = New ADODB.Recordset
AdoRcrdSet.Open strSQL, AdoConnect
' header column names
For i = 0 To AdoRcrdSet.Fields.Count - 1
sField = WorksheetFunction.Substitute(AdoRcrdSet.Fields(i).Name, sBOM, "")
Wsh.Cells(1, i + 1).Value = sField
Next
' row values
iRow = 2
AdoRcrdSet.MoveFirst
While Not AdoRcrdSet.EOF
For i = 0 To AdoRcrdSet.Fields.Count - 1
sValue = IIf(IsNull(AdoRcrdSet.Fields(i).Value), "", AdoRcrdSet.Fields(i).Value)
Wsh.Cells(iRow, 1 + i).Value = sValue
Next
AdoRcrdSet.MoveNext
iRow = iRow + 1
Wend
End Sub
Btw I understand that I can also use the Open file dialog and manually set the datatype and format for each column, but that ignores the schema.ini completely. And also, for other puposes, it is a requirement to use the ACE OLEDB Engine.
So my question is, is it possible to detect these type of data errors? Can I somehow make a distinction between data errors and actually empty/null values? Or can I use an extra setting in the connection string to make the ACE OLEDB Engine throw errors, or make it set set error values to something else than null?
For anyone interested in this question, I ended up creating a CSV Lint plug-in for Notepad++.
The plug-in can automatically detect the type of csv (comma, tab etc.) and the data types of the columns. This datadefinition is based on schema.ini format. The plug-in can then validate the data against this metadata.
The automatic column detection works most of the time, it doesn't always detect the datatypes correctly. However you can manually edit the metadata before validating the data file, so that shouldn't be a problem. It also works with fixed width datafiles and it includes an option for syntax highlighting as you can see in the screenshot.
I hope to update this Notepad++ plug-in with additional features, but it works great for detecting data errors in csv files.
You can get the plug-in here:
http://github.com/BdR76/CSVLint