Search code examples
c#openxmldocxsql-server-openxml

Generating docx file from HTML file using OpenXML


I'm using this method for generating docx file:

public static void CreateDocument(string documentFileName, string text)
{
    using (WordprocessingDocument wordDoc =
        WordprocessingDocument.Create(documentFileName, WordprocessingDocumentType.Document))
    {
        MainDocumentPart mainPart = wordDoc.AddMainDocumentPart();

        string docXml =
                    @"<?xml version=""1.0"" encoding=""UTF-8"" standalone=""yes""?>
                 <w:document xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">
                 <w:body><w:p><w:r><w:t>#REPLACE#</w:t></w:r></w:p></w:body>
                 </w:document>";

        docXml = docXml.Replace("#REPLACE#", text);

        using (Stream stream = mainPart.GetStream())
        {
            byte[] buf = (new UTF8Encoding()).GetBytes(docXml);
            stream.Write(buf, 0, buf.Length);
        }
    }
}

It works like a charm:

CreateDocument("test.docx", "Hello");

But what if I want to put HTML content instead of Hello? for example:

CreateDocument("test.docx", @"<html><head></head>
                              <body>
                                    <h1>Hello</h1>
                              </body>
                        </html>");

Or something like this:

CreateDocument("test.docx", @"Hello<BR>
                                    This is a simple text<BR>
                                    Third paragraph<BR>
                                    Sign
                        ");

both cases creates an invalid structure for document.xml. Any idea? How can I generate a docx file from a HTML content?


Solution

  • You cannot just insert the HTML content into a "document.xml", this part expects only a WordprocessingML content so you'll have to convert that HTML into WordprocessingML, see this.

    Another thing that you could use is altChunk element, with it you would be able to place a HTML file inside your DOCX file and then reference that HTML content on some specific place inside your document, see this.

    Last as an alternative, with GemBox.Document library you could accomplish exactly what you want, see the following:

    public static void CreateDocument(string documentFileName, string text)
    {
        DocumentModel document = new DocumentModel();
        document.Content.LoadText(text, LoadOptions.HtmlDefault);
        document.Save(documentFileName);
    }
    

    Or you could actually straightforwardly convert a HTML content into a DOCX file:

    public static void Convert(string documentFileName, string htmlText)
    {
        HtmlLoadOptions options = LoadOptions.HtmlDefault;
        using (var htmlStream = new MemoryStream(options.Encoding.GetBytes(htmlText)))
            DocumentModel.Load(htmlStream, options)
                         .Save(documentFileName);
    }