Search code examples
.netoffice-interop

.NET MS Interop Word not saving document in UTF8 web page


Note: A sample document I used for test can be foud in: http://ftp.3gpp.org//Specs/archive/38_series/38.413/38413-100.zip

Problem

I am trying to convert an MS Word 97-2003 document (.doc) into a UTF-8 web page with the following code:

var wordApp = new Word.Application();
var doc = wordApp.Documents.Open("input.doc");
Console.WriteLine(doc.TextEncoding); // msoEncodingWestern
doc.SaveEncoding = MsoEncoding.msoEncodingUTF8;
doc.WebOptions.Encoding = MsoEncoding.msoEncodingUTF8;
doc.SaveAs2("output.htm", WdSaveFormat.wdFormatFilteredHTML, Encoding: MsoEncoding.msoEncodingUTF8);
doc.Close();
wordApp.Quit();

The problem is the document contains a certain character which is rendered incorrectly in a web page:

In document

enter image description here

In web page

enter image description here

(Information) Manual Way

For information, if I do the above in a manual way as below, the arrow character is rendered correctly in web page.

enter image description here


Solution

  • I solved the problem with the following:

    var from = ((char)0xF0AE).ToString();
    var to = ((char)0x2192).ToString();
    doc.Content.Find.Execute(from, ReplaceWith: to, Replace: WdReplace.wdReplaceAll);
    

    It's not a generalized solution, i.e. this method only handles a right arrow case and another method needs to be defined if a left arrow is an issue.