Search code examples
ms-accessutf-8importbyte-order-mark

MS Access Import UTF-8 file with BOM causes issue if the next char is double quote?


I found a strange issue, I'm wondering if anyone knows either why this happens or if it is a known bug (yes I Googled).

I have a database where I have to import CSV files using Access. I'm making some changes to the project to allow UNICODE characters to be into the database. There is some preprocessing involved in VBA by reading and writing using ADODB.Stream.

This ADODB.Stream is the only method of processing UNICODE files that I found to be of a useful level. However, I'm writing UTF-8 files with it and obviously, as it should, it writes a BOM (Byte Order Mark) at the start of the file: .

So far so good Microsoft, way to support standards! Now when this file is saved I want to import it using an ImportExportSpecification. I specifically specify Microsoft's Codepage 65001 which is what the rest of the world calls utf-8. Then try to import and no surprise it works.. Until your file starts with a double quote, then Access complains that the record 2 (2??) could not be parsed.

The unfortunate thing is that I cannot control the source of these files and I'm not involved in the actual imports, it's automated.. So now I either have to remove the BOM, which I don't really want to do.. because I need to read and write the entire file from to disk or in memory and these files are insanely huge.. Or remove the double quote, which is actually necessary for the headers of a file (that I can't control either).

TL;DR: It appears that if your files starts with " you break Access's ImportExport module. Is it a bug, a "feature" or why the .. does it happen?

Ok, so I will remove the BOM, so Christmas doesn't have to start without me. But it's not a solution, it's a band aid. I would very much like to know how to fix this, it's itching me.. :)

Merry Christmas!


Solution

  • As Guntram Blohm pointed out:

    Maybe the access import module reads the BOM, decides on UTF-16, and has a fallback to UTF-8 if the next 2 bytes are not a valid codepoint. This seems to be the case, it may be a bug, it may be that UTF-8 with BOM was not implemented completely, I don't know.. If you do feel free to drop a comment, I'm still curious.

    Anyway, there are 3 ways you can "solve" the issue:

    1. Remove the BOM (rewriting the entire file, could be done in memory rather than reading & writing from disk) see here: http://axlr8r.blogspot.nl/2011/05/how-to-export-data-into-utf-8-without.html

    2. Remove the double quote or place an extra character after the BOM if you can.

    3. Save the file as UTF-16 which Access seems to have better support for. Compared to an ANSI file a UTF-16 file will be twice the size. Whereas UTF-8 would only add extra bytes per character when special characters are met. In my case the file is in an intermediary state, I'm not keeping it, after importing I delete the file. So I chose to save the file as UTF-16 and import it with Codepage 1200 (which is UTF-16 for the rest of the world).