Search code examples
vbams-wordasciiextended-ascii

How to process multibyte symbols in MS Word VBA


I'm trying to extract the text from a large number of old Word files and put the text in a database. I'm doing this by parsing the text into sections, creating a text file for each section, and then doing a bulk insert.

Whoever wrote these documents originally used some strange characters. In Word the look similar to the normal extended ASCII characters. But when I start looking at the decimal or hex codes, they are not any known character.

strange characters and their ASCII equivalents

On the left is the character originally in the document and on the right is the appropriate ASCII character entered from the keyboard.

When I copy and paste them into vim, it looks like this:

vim's interpretation of the characters

Using vim to look at the decimal and hex codes, the file looks like this:

Original      True ASCII
Dec    Hex    Dec    Hex
61617  f0b1   177    00b1
61666  f0e2   174    00ae
 8220  201c    34    22
 8221  201d    34    22

I'm using some of the lines from code found here:

NextChar = ActiveDocument.Characters(idx)
Dim nBytes As Long
Dim abBuffer() As Byte

nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(NextChar), -1, vbNull, 0&, 0&, 0&)
ReDim abBuffer(4)
nBytes = WideCharToMultiByte(CP_UTF8, 0&, ByVal StrPtr(NextChar), -1, ByVal VarPtr(abBuffer(0)), nBytes - 1, 0&, 0&)

Using this method, VBA returns the following decimal codes (some are multiple bytes, expressed using commas):

Original        True ASCII
Dec             Dec
40              194, 177
40              194, 174
226, 128, 156   34
226, 128, 157   34

I've also tried Asc and AscW. These work fine on the quote marks somehow, returning only the final byte. But since ActiveDocument.Characters returns a parenthesis for the other two, it just processes it like a parenthesis.

I have several questions regarding these various outputs as well as how to properly handle these characters.

  • Why does ActiveDocument.Characters return a parenthesis when reading the plus/minus and the registered trademark symbol?
  • Why is 194 placed in front when using those characters?
  • Most ASCII tables I've seen follow the encoding seen here. But those indicate that 177 and 174 should be dots and double arrows, respectively. Which does not match up to Word or vim. But this table seems to agree with Word and vim. Are there multiple ASCII encodings? I thought it was a standard.
  • What is the proper way to read in these special, multiple byte characters so that I can identify them and replace them with their ASCII equivalents?

Edit:

Just learned about using AltX in Word to change each character into it's unicode number. This works fine on the original quotation marks but it does nothing when I try it on the original plus/minus and trademark symbol. Not sure where these characters come from.


Edit 2:

I tried saving into a text file. The plus/minus and trademark symbol will not properly convert in Western European (Windows) encoding. UTF-8 is better but also has problems. Unicode will convert everything but it converts the problem characters to whatever the preceding character is.

Western European (Windows)

Unicode


Edit 3:

Link to test file


Edit 4:

I used the Open XML Productivity Tool and looked at the XML directly and possible code to create these weird symbols. This is what I've found:

<w:r w:rsidRPr="00EE7521">
 <w:rPr>
   <w:sz w:val="16" />
 </w:rPr>
 <w:sym w:font="Symbol" w:char="F0B1" />

And:

RunProperties runProperties1 = new RunProperties();
FontSize fontSize2 = new FontSize(){ Val = "16" };

runProperties1.Append(fontSize2);
SymbolChar symbolChar1 = new SymbolChar(){ Font = "Symbol", Char = "F0B1" };

Is there a way to detect and properly decode SymbolChar typed characters in VBA? Or any other language at this point.


Solution

  • I just tested this simple macro and it successfully replaced instances of your first two symbols (f0b1, f0e2) with their ASCII equivalents on my test document. It just clears the first byte when detecting that the character is from the PUA (Private Use Area).

    Private Sub Strip_PUA()
        For idx = 1 To ActiveDocument.Characters.Count
            Dim bArr() As Byte
            bArr = ActiveDocument.Characters(idx)
    
            If bArr(1) >= &HE0 And bArr(1) <= &HF8 Then
                bArr(1) = 0
                ActiveDocument.Characters(idx) = bArr
            End If
        Next
    End Sub
    

    You might have to tweak the bArr to something other than 1 based on endianness of the system and bytecount of the characters. This also relies on the fact that the characters you've encountered happen to have the correct ASCII byte already. That might not always be the case - you'll have to investigate to be sure.


    EDIT: Reproduced from this Google Groups discussion.

    Sub SymbolsUnprotect()
      Dim SelFont, SelCharNum
    
      Selection.Collapse (wdCollapseStart)
      Selection.Find.ClearFormatting
      With Selection.Find
        .Text = "[" & ChrW(61472) & "-" & ChrW(61695) & "]"
        .Replacement.Text = ""
        .Forward = True
        .Wrap = wdFindContinue
        .Format = False
        .MatchCase = False
        .MatchWholeWord = False
        .MatchAllWordForms = False
        .MatchSoundsLike = False
        .MatchWildcards = True
      End With
      While Selection.Find.Execute
        With Dialogs(wdDialogInsertSymbol)
          SelFont = .Font
          SelCharNum = .CharNum
        End With
    
        Selection.Font.Name = SelFont
        Selection.TypeText Text:=ChrW(SelCharNum)
    
        ' replace the last 2 lines with the following to
        ' protect symbols from decorative fonts:
        ' Selection.InsertSymbol _
        '   Font:=SelFont, _
        '   CharacterNumber:=SelCharNum, _
        '   Unicode:=True
    
      Wend
    End Sub
    

    This will "unprotect" symbols, which will show up as a "(" (decimal 40) character when protected - which is the default for symbols inserted through Word's Insert > Symbol Dialog. It will allow you to properly read the bytes of these characters as f0**, but won't be able to tell you exactly what those bytes correspond to for fonts like Symbol, which define their own mappings in the PUA. Look up these mappings for conversion into Unicode (linked to the specific character block containing the mappings of ± (Symbol 177) to (Unicode 177) and of ® (Symbol 226) to (Unicode 174)).