Search code examples
xmlvbautf-16byte-order-mark

VBA Output to file using UTF-16


I have a very complex problem that is difficult to explain properly. There is LOTS of discussion about this across the internet, but nothing definitive. Any help, or better explanation than mine, is greatly appreciated.

Essentially, I'm just trying to write an XML file using UTF-16 with VBA.

If I do this:

sXML = "<?xml version='1.0' encoding='utf-8'?>"
sXML = sXML & rest_of_xml_document
Print #iFile, sXML

then I get a file that is valid XML. However, if I change the "encoding=" to "utf-16", I get this error from my XML validator:

Switch from current encoding to specified encoding not supported.

Googling tells me that this means the xml encoding attribute is different to the ACTUAL encoding used by the file, hence I must be creating a utf-8 document via Open and Print commands.

If I do something like:

With CreateObject("ADODB.Stream")
  .Type = 2
  .Charset = "utf-16"
  .Open
  .WriteText sXML
  .SaveToFile sFilename, 2
  .Close
End With

then I end up with some funky characters (the BOM) at the beginning of my file which causes it to fail XML validation.

If I open the file in Notepad++, delete the BOM and change the Encoding to "UCS-2", then the file validates fine with a "utf-16" encoding value (meaning that UCS-2 is close enough to UTF-16 that it doesnt matter, or that XML is able to Switch from current encoding between these two types.

I need to use UTF-16 because UTF-8 doesn't cover all the characters used in the presentations I'm exporting.

The question:

How can I get VBA to behave like Notepad++, creating a UTF-16-encoded text file without a BOM that can be filled with XML data? ANY help much appreciated!


Solution

  • Your point about UTF-8 not being able to store all characters you need is invalid.
    UTF-8 is able to store every character defined in the Unicode standard.
    The only difference is that, for text in certain languages, UTF-8 can take more space to store its codepoints than, say, UTF-16. The opposite is also true: for certain other languages, such as English, using UTF-8 saves space.

    VB6 and VBA, although store strings in memory in Unicode, implicitly switch to ANSI (using the current system code page) when doing file IO. The resulting file you get is NOT in UTF-8. It is in your current system codepage, which, as you can discover in this helpful article, looks just like UTF-8 if you're from USA.

    Try:

    Dim s As String
    s = "<?xml version='1.0' encoding='utf-16'?>"
    s = s & ChrW$(&H43F&) & ChrW$(&H440&) & ChrW$(&H43E&) & ChrW$(&H432&) & ChrW$(&H435&) & ChrW$(&H440&) & ChrW$(&H43A&) & ChrW$(&H430&)
    
    Dim b() As Byte
    b = s
    
    Open "Unicode.txt" For Binary Access Write As #1
    Put #1, , b
    Close #1
    

    And if you absolutely must have UTF-8, you can make yourself some:

    Option Explicit
    
    Private Declare Function WideCharToMultiByte Lib "kernel32.dll" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpWideCharStr As Long, ByVal cchWideChar As Long, ByRef lpMultiByteStr As Byte, ByVal cchMultiByte As Long, ByVal lpDefaultChar As String, ByRef lpUsedDefaultChar As Long) As Long
    
    Private Const CP_UTF8 As Long = 65001
    Private Const ERROR_INSUFFICIENT_BUFFER As Long = 122&
    
    
    Public Function ToUTF8(s As String) As Byte()
    
      If Len(s) = 0 Then Exit Function
    
    
      Dim ccb As Long
      ccb = WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), ByVal 0&, 0, vbNullString, ByVal 0&)
    
      If ccb = 0 Then
        Err.Raise 5, , "Internal error."
      End If
    
      Dim b() As Byte
      ReDim b(1 To ccb)
    
      If WideCharToMultiByte(CP_UTF8, 0, StrPtr(s), Len(s), b(LBound(b)), ccb, vbNullString, ByVal 0&) = 0 Then
        Err.Raise 5, , "Internal error."
      Else
        ToUTF8 = b
      End If
    
    End Function
    
    Sub Test()
      Dim s As String
      s = "<?xml version='1.0' encoding='utf-8'?>"
      s = s & ChrW$(&H43F&) & ChrW$(&H440&) & ChrW$(&H43E&) & ChrW$(&H432&) & ChrW$(&H435&) & ChrW$(&H440&) & ChrW$(&H43A&) & ChrW$(&H430&)
    
      Dim b() As Byte
      b = ToUTF8(s)
    
      Open "utf-8.txt" For Binary Access Write As #1
      Put #1, , b
      Close #1
    End Sub