Search code examples
c#openxmlopenxml-sdk

Add HTML String to OpenXML (*.docx) Document


I am trying to use Microsoft's OpenXML 2.5 library to create a OpenXML document. Everything works great, until I try to insert an HTML string into my document. I have scoured the web and here is what I have come up with so far (snipped to just the portion I am having trouble with):

Paragraph paragraph = new Paragraph();
Run run = new Run();

string altChunkId = "id1";
AlternativeFormatImportPart chunk =
       document.MainDocumentPart.AddAlternativeFormatImportPart(
           AlternativeFormatImportPartType.Html, altChunkId);
chunk.FeedData(new MemoryStream(Encoding.UTF8.GetBytes(ioi.Text)));
AltChunk altChunk = new AltChunk { Id = altChunkId };

run.AppendChild(new Break());

paragraph.AppendChild(run);
body.AppendChild(paragraph);

Obviously, I haven't actually added the altChunk in this example, but I have tried appending it everywhere - to the run, paragraph, body, etc. In ever case, I am unable to open up the docx file in Word 2010.

This is making me a little nutty because it seems like it should be straightforward (I will admit that I'm not fully understanding the AltChunk "thing"). Would appreciate any help.

Side Note: One thing I did find that was interesting, and I don't know if it's actually a problem or not, is this response which says AltChunk corrupts the file when working from a MemoryStream. Can anybody confirm that this is/isn't true?


Solution

  • I can reproduce the error "... there is a problem with the content" by using an incomplete HTML document as the content of the alternative format import part. For example if you use the following HTML snippet <h1>HELLO</h1> MS Word is unable to open the document.

    The code below shows how to add an AlternativeFormatImportPart to a word document. (I've tested the code with MS Word 2013).

    using (WordprocessingDocument doc = WordprocessingDocument.Open(@"test.docx", true))
    {
      string altChunkId = "myId";
      MainDocumentPart mainDocPart = doc.MainDocumentPart;
    
      var run = new Run(new Text("test"));
      var p = new Paragraph(new ParagraphProperties(
           new Justification() { Val = JustificationValues.Center }),
                         run);
    
      var body = mainDocPart.Document.Body;
      body.Append(p);        
    
      MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<html><head></head><body><h1>HELLO</h1></body></html>"));
    
      // Uncomment the following line to create an invalid word document.
      // MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<h1>HELLO</h1>"));
    
      // Create alternative format import part.
      AlternativeFormatImportPart formatImportPart =
         mainDocPart.AddAlternativeFormatImportPart(
            AlternativeFormatImportPartType.Html, altChunkId);
      //ms.Seek(0, SeekOrigin.Begin);
    
      // Feed HTML data into format import part (chunk).
      formatImportPart.FeedData(ms);
      AltChunk altChunk = new AltChunk();
      altChunk.Id = altChunkId;
    
      mainDocPart.Document.Body.Append(altChunk);
    }
    

    According to the Office OpenXML specification valid parent elements for the w:altChunk element are body, comment, docPartBody, endnote, footnote, ftr, hdr and tc. So, I've added the w:altChunk to the body element.

    For more information on the w:altChunk element see this MSDN link.

    EDIT

    As pointed out by @user2945722, to make sure that the OpenXml library correctlty interprets the byte array as UTF-8, you should add the UTF-8 preamble. This can be done this way:

    MemoryStream ms = new MemoryStream(new UTF8Encoding(true).GetPreamble().Concat(Encoding.UTF8.GetBytes(htmlEncodedString)).ToArray()
    

    This will prevent your é's from being rendered as é's, your ä's as ä's, etc.