Search code examples
excelvbaadodbvba7vba6

Fix BOM issues when reading UTF-8 encoded CSVs with VBA ()


I'd like to get fresh advice on the infamous issue caused by the Byte Order Mark ( or EF BB BF in hex) when trying to read UTF-8 encoded CSVs with VBA (Excel). Note that I'd like to avoid opening the CSV with Workbooks.Open or FileSystemObject. Actually, I'd rather use an adodb.RecordSet as I need to execute some kind of SQL queries.

After having read many (many!) things, I believe the 4 best solutions to deal with this specific issue are:

  • Removing the BOM before reading the CSV with ADODB.Connection / ADODB.RecordSet (for instance, via #iFile or Scripting.FileSystemObject-OpenAsTextStream to efficiently read the first line of the file and remove the BOM).
  • Creating a schema.ini file so that ADO properly parses the CSV.
  • Using some modules created by wizards (like W. Garcia's class module).
  • Using an ADODB.Stream and setting Charset = "UTF-8".

The last solution (using a stream) seems quite fine but doing the following returns a string:

Sub loadCsv()

    Const adModeReadWrite As Integer = 3

    With CreateObject("ADODB.Stream")
        .Charset = "utf-8"
        .Mode = adModeReadWrite
        .Open
        .LoadFromFile ("C:\atestpath\test.csv")
        Debug.Print .readtext
    End With
 
End Sub

Do you know any trick that could help using the string returned by .readtext as the Data Source of an ADODB.RecordSet or ADODB.Connection (apart from looping to manually populate the fields of my recordset)?


Solution

  • EDIT: I found that loading the CSV with a querytable object (see this good example) or through a WorkbookQuery object (introduced in Excel 2016) are the easiest and probably most reliable ways to proceed (see an example from the documentation here).

    OLD ANSWER:

    Talking with @Profex encouraged me to further investigate the issue. Turns out there are 2 problems: the BOM and the delimiter used for the CSV. The ADO connection string I need to use is :

    strCon = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\test\;Extended Properties='text;HDR=YES;CharacterSet=65001;FMT=Delimited(;)'"
    

    But FMT does not work with a semicolon (FMT=Delimited(;)), at least with Microsoft.ACE.OLEDB.12.0 on a x64 system (Excel x64). Thus, @Profex was quite right to state:

    even though the first field name has a ? in front of it, it doesn't look like it actually matters

    given that he was using FMT=Delimited on a CSV delimited by a simple comma (",").

    Some people suggest to edit the registry so that the semicolon delimiter is accepted. I'd like to avoid that. Also, I'd rather not create a schema.ini file (even if that may be the best solution for complex CSVs). Thus, the only solutions remaining require to edit the CSV before creating the ADODB.Connection.

    I know my CSV will always have the problematical BOM as well as the same basic structure (something like "date";"count"). Thus I decided to go with this code:

    Dim arrByte() As Byte
    Dim strFilename As String
    Dim iFile As Integer
    Dim strBuffer As String
    strFilename = "C:\Users\test\t1.csv"
    If Dir(strFilename) <> "" Then 'check if the file exists, because if not, it would be created when it is opened for Binary mode.
        iFile = FreeFile
        Open strFilename For Binary Access Read Write As #iFile
        strBuffer = String(3, " ") 'We know the BOM has a length of 3
        Get #iFile, , strBuffer
        If strBuffer = "" 'Check if the BOM is there
            strBuffer = String(LOF(iFile) - 3, " ")
            Get #iFile, , strBuffer 'the current read position is ok because we already used a Get. We store the whole content of the file without the BOM in strBuffer
            arrByte = Replace(strBuffer, ";", ",") 'We replace every semicolon by a colon
            Put #iFile, 1, arrByte
        End If
        Close #iFile
    End If
    

    (note: one might use arrByte = StrConv(Replace(strBuffer, ";", ","), vbFromUnicode) because the bytes array is in ANSI format).