Search code examples
c#office-interop

Converting html strings in Excel file to formatted word file with .NET


Input are Excel files - the cells may contain some basic HTML formatting like <b>, <br>, <h2>.

I want to read the strings and insert the text as formatted text into word documents, i.e. <b>Foo</b> would be shown as a bold string in Word.

I don't know which tags are used so I need a "generic solution", a find/replace approach does not work for me.

I found a solution from January 2011 using the WebBrowser component. So the HTML is converted to RTF and the RTF is inserted into Word. I was wondering if there is a better solution today.

Using a commercial component is fine for me.

Update

I came across Matthew Manela's MarkupConverter class. It converts HTML to RTF. Then I use the clipboard to insert the snippet into the word file

// rtf contains the converted html string using MarkupConverter
Clipboard.SetText(rtf, TextDataFormat.Rtf);
// objTable is a table in my word file
objTable.Cell(1, 1).Range.Paste();

This works, but will copy/pasting up to a few thousand strings using the clipboard break anything?


Solution

  • You will need the OpenXML SDK in order to work with OpenXML. It can be quite tricky getting into, but it is very powerful, and a whole lot more stable and reliable than Office Automation or Interop.

    The following will open a document, create an AltChunk part, add the HTML to it, and embed it into the document. For a broader overview of AltChunk see Eric White's blog

    using (var wordDoc = WordprocessingDocument.Open("DocumentName.docx", true))
    {
        var altChunkId = "AltChunkId1";
        var mainPart = wordDoc.MainDocumentPart;
    
        var chunk = mainPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html, altChunkId);
        using (var textStream = new MemoryStream())
        {
            var html = "<html><body>...</body></html>";
            var data = Encoding.UTF8.GetBytes(html);
            textStream.Write(data, 0, data.Length);
            textStream.Position = 0;
            chunk.FeedData(textStream);
        }
    
        var altChunk = new AltChunk();
        altChunk.Id = altChunkId;
        mainPart.Document.Body.InsertAt(altChunk, 0);
        mainPart.Document.Save();
    }
    

    Obviously for your case, you will want to find (or build) the table you want and insert the AltChunk there instead of at the first position in the body. Note that the HTML that you insert into the word doc must be full HTML documents, with an <html> tag. I'm not sure if <body> is required, but it doesn't hurt. If you just have HTML formatted text, simply wrap the text in these tags and insert into the doc.

    It seems that you will need to use Office Automation/Interop to get the table heights. See this answer which says that the OpenXML SDK does not update the heights, only Word does.