Search code examples
c#asp.netopenxmlopenxml-sdk

How to write rich text to word document generated from htm file in C#


I am trying to generate a word doc from saved HTML file using an Open XML library. If the HTML file does not contain an image I can simply use the code below and write text content to word doc.

HtmlDocument doc = new HtmlDocument();
doc.Load(fileName); //fileName is the Htm file
string Detail = string.Empty;
string webData = string.Empty;
HtmlNode hcollection = doc.DocumentNode.SelectSingleNode("//body");
Detail = hcollection.InnerText;

But if the HTML file contains an embedded image I am struggling to include that image in the word doc.

Using hcollection.InnerText only writes the text part and excludes the image.

When I use

HtmlNode hcollection = doc.DocumentNode.SelectSingleNode("//body");
Detail = hcollection.InnerHtml;

All the HTML tags get written to the word doc along with path of Image in the tag

<table border='0' width='100%' cellpadding='0' cellspacing='0' align='center'>
<tr><td valign='top' align="left">
<div style='width:100%'><div id="div_img">
<div>
 <img src="http://www.myweb.com/web/img/2013/07/18/img_1.jpg">
 <span>Sample Text</span></div></div><br><br>Sample Text Content here<br><br>                         </div></td></tr></table>

How to remove the html tags and instead of path shown like

<img src="http://www.myweb.com/web/img/2013/07/18/img_1.jpg">

the corresponding picture gets loaded.

Please help.


Solution

  • You'll need to look at the HTML and translate it to OpenXML somehow.

    I've used HtmlToOpenXml open-source library (license), and that works well enough. It should handle images (inline, local or remote) and correctly insert them into the OpenXML document. I recently submitted a patch which was accepted, so the project is still somewhat active.

    There are some limitations with the library though:

    Javascript (<script>), CSS <style>, <meta> and other not supported tags does not generate an error but are ignored.

    It does handle inline style information, but it entirely ignores other CSS, which was something I needed. I ended up integrating some simple parsing of a single <style> element from another open-source project (jsonfx, using MIT license).

    Note: handling multiple <style> elements, downloading CSS files, sorting out which style rules have precedence -- these are all problems which I did not address.