Note: A sample document I used for test can be foud in: http://ftp.3gpp.org//Specs/archive/38_series/38.413/38413-100.zip
I am trying to convert an MS Word 97-2003 document (.doc) into a UTF-8 web page with the following code:
var wordApp = new Word.Application();
var doc = wordApp.Documents.Open("input.doc");
Console.WriteLine(doc.TextEncoding); // msoEncodingWestern
doc.SaveEncoding = MsoEncoding.msoEncodingUTF8;
doc.WebOptions.Encoding = MsoEncoding.msoEncodingUTF8;
doc.SaveAs2("output.htm", WdSaveFormat.wdFormatFilteredHTML, Encoding: MsoEncoding.msoEncodingUTF8);
doc.Close();
wordApp.Quit();
The problem is the document contains a certain character which is rendered incorrectly in a web page:
In document
In web page
For information, if I do the above in a manual way as below, the arrow character is rendered correctly in web page.
I solved the problem with the following:
var from = ((char)0xF0AE).ToString();
var to = ((char)0x2192).ToString();
doc.Content.Find.Execute(from, ReplaceWith: to, Replace: WdReplace.wdReplaceAll);
It's not a generalized solution, i.e. this method only handles a right arrow case and another method needs to be defined if a left arrow is an issue.