Search code examples
c#openxmlopenxml-sdk

Convert Docx to html using OpenXml power tools without formatting


I'm using OpenXml Power tools in my project to convert a document (docx) into html, using the code already provided with this sdk it produces an elegant duplicate in html form.(Github link : https://github.com/OfficeDev/Open-Xml-PowerTools/blob/vNext/OpenXmlPowerToolsExamples/HtmlConverter01/HtmlConverter01.cs )

However looking at the html markup, the html has embedded styling.

Is there any way of turning this off and using plain and simple <h1> and <p> tags ?

I would like to know this embedded styling as the formatting would be taken care of by bootstrap.

The embedded styling is as follows :

 <p dir="ltr" style="font-family: Calibri;font-size: 11pt;line-height: 115.0%;margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;">
 <span xml:space="preserve" style="font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"> </span>
 </p>

This as you can see is fine if you want a direct copy, but not if you want to control the style yourself.

In the C# code i have already made the following ajustments :

  • AdditionalCss is commented out
  • FabricateCssClasses is false
  • CssClassPrefix is commented out

Many thanks.


Solution

  • If you can also the XmlReader and XmlWriter to obtain a bare bone html. This could however be a little overkill, as only the tag itself and its text content will be kept.

    public static class HtmlHelper
    {
        /// <summary>
        /// Keep only the openning and closing tag, and text content from the html
        /// </summary>
        public static string CleanUp(string html)
        {
            var output = new StringBuilder();
            using (var reader = XmlReader.Create(new StringReader(html)))
            {
                var settings = new XmlWriterSettings() { Indent = true, OmitXmlDeclaration = true };
                using (var writer = XmlWriter.Create(output, settings))
                {
                    while (reader.Read())
                    {
                        switch (reader.NodeType)
                        {
                            case XmlNodeType.Element:
                                writer.WriteStartElement(reader.Name);
                                break;
                            case XmlNodeType.Text:
                                writer.WriteString(reader.Value);
                                break;
                            case XmlNodeType.EndElement:
                                writer.WriteFullEndElement();
                                break;
                        }
                    }
                }
            }
    
            return output.ToString();
        }
    }
    

    Resulting output :

    <p>
      <span></span>
    </p>