Search code examples
csvvalidationerror-handlingoledb

Read csv data with ACE OLEDB Engine, how to detect errors in data


In Excel VBA, I'm trying to read a CSV text file using ACE OLEDB Engine plus a schema.ini file. This works, but when there is an error in the data I expected the ACE OLEDB Engine to throw an exception.

However, it just reads error values in the data as null, passing them as empty values, and it simply continues without any error messages. Especially with large datasets this is unworkable, because there's just no way to easily spot any errors by hand.

Here is a tab separated data set, with some intentional errors to reproduce it.

OrderID OrderDate   Amount  Price   ArtNo   
40361   07/09/2019  2   59.95   4006633374668   
40362   07/10/2019  2   42,95   4006633146647   
40363   07/11/2019  1   69.75   4016987119501   
40364   07/12/2019  4   4.99    4082300264630   
40365   07/13/2019  1   39.95   4026736081348   
40366   07/14/2019  1   230.00  4013872786831   
40367   07/29/2019  1   42.95   4006633294256   
40368   07/30/2019  1   3,299.00    GTIN:0191215072422  
40369   07/31/2019  1   86.95   4010858791506   
40370   07/32/2019  10  8.99    4029416288686   

And the corresponding schema.ini file

[testdata.txt]
DateTimeFormat=m/d/yyyy
DecimalSymbol=.
Format=TabDelimited
Col1=OrderID Integer Width 5
Col2=OrderDate Date Width 10
Col3=Amount Integer Width 2
Col4=Price Float Width 10
Col5=ArtNo Text Width 13

After connecting to the text data file with ACE OLEDB and doing SELECT * FROM textfile the result is this:
Excel ACE OLEDB Engine errors are just set to null

Some of the cells are empty, even though there are values in the data file.
B: value "42,95" is an error the DecimalSymbol is set to point.
D: value "3,299.00" is an error because of the thousand comma
E: value "07/32/2019" is an incorrect date
F*: value "GTIN:0191215072422" too long, E9 is cut off at 13 chars (*not labeled in image)

(btw also the date values A are interpreted incorrectly, even though the setting DateTimeFormat=m/d/yyyy is correct, yet C is correct probably just because the day is higher than 12, but Excel is always fidgety with dateformats so I'll just ignore those here)

Below is the VBA code to load a text data file using the ACE OLEDB Engine.

Sub ReadCsvTest(sPath As String, sFile As String, sBOM As String)

    Dim Wsh As Worksheet
    Dim AdoConnect As ADODB.Connection ' Tools > References > select "Microsoft ActiveX Data Objects x.x Library"
    Dim AdoRcrdSet As ADODB.Recordset
    Dim strSQL As String
    Dim sField As String
    Dim sValue As String
    Dim i As Integer
    Dim iRow As Integer
    ' not sure what it does, but just set sBOM = "\xEF\xBB\xBF"

    ' Add a new sheet
    With ActiveWorkbook
        Set Wsh = .Sheets.Add(After:=.Sheets(.Sheets.Count))
        'Wsh.Name = NewSheetName        'rename new Sheet
    End With

    Set AdoConnect = New ADODB.Connection
    AdoConnect.Provider = "Microsoft.ACE.OLEDB.12.0"
    AdoConnect.ConnectionString = "Data Source=" & sPath & ";Extended Properties='text';"
    AdoConnect.Open

    strSQL = "select * from " & sFile
    Set AdoRcrdSet = New ADODB.Recordset
    AdoRcrdSet.Open strSQL, AdoConnect

    ' header column names
    For i = 0 To AdoRcrdSet.Fields.Count - 1
        sField = WorksheetFunction.Substitute(AdoRcrdSet.Fields(i).Name, sBOM, "")
        Wsh.Cells(1, i + 1).Value = sField
    Next

    ' row values
    iRow = 2
    AdoRcrdSet.MoveFirst
    While Not AdoRcrdSet.EOF
        For i = 0 To AdoRcrdSet.Fields.Count - 1
            sValue = IIf(IsNull(AdoRcrdSet.Fields(i).Value), "", AdoRcrdSet.Fields(i).Value)
            Wsh.Cells(iRow, 1 + i).Value = sValue
        Next

        AdoRcrdSet.MoveNext
        iRow = iRow + 1
    Wend

End Sub

Btw I understand that I can also use the Open file dialog and manually set the datatype and format for each column, but that ignores the schema.ini completely. And also, for other puposes, it is a requirement to use the ACE OLEDB Engine.

So my question is, is it possible to detect these type of data errors? Can I somehow make a distinction between data errors and actually empty/null values? Or can I use an extra setting in the connection string to make the ACE OLEDB Engine throw errors, or make it set set error values to something else than null?


Solution

  • For anyone interested in this question, I ended up creating a CSV Lint plug-in for Notepad++.

    The plug-in can automatically detect the type of csv (comma, tab etc.) and the data types of the columns. This datadefinition is based on schema.ini format. The plug-in can then validate the data against this metadata.

    enter image description here

    The automatic column detection works most of the time, it doesn't always detect the datatypes correctly. However you can manually edit the metadata before validating the data file, so that shouldn't be a problem. It also works with fixed width datafiles and it includes an option for syntax highlighting as you can see in the screenshot.

    I hope to update this Notepad++ plug-in with additional features, but it works great for detecting data errors in csv files.

    You can get the plug-in here:
    http://github.com/BdR76/CSVLint