I'm trying to convert xml formatted with tags to a DOCX file. I'm not generating a new document, but inserting text in a template document.
<p id="_fab91699-6d85-4ce5-b0b5-a17197520a7f">This document is amongst a series of International Standards dealing with the conversion of systems of writing produced by Technical Committee ISO/TC 46, <em>Information and documentation</em>, WG 3 <em>Conversion of written languages</em>.</p>
I collected the text fragments in an array, then tried to process them with code like this:
foreach (var bkmkStart in wordDoc.MainDocumentPart.RootElement.Descendants<BookmarkStart>())
{
if (bkmkStart.Name == "ForewordText")
{
forewordbkmkParent = bkmkStart.Parent;
for (var y = 0; y <= ForewordArray.Length / (double)2 - 1; y++)
{
if (ForewordArray[0, y] == "Normal")
{
if (y < ForewordArray.Length / (double)2 - 1)
{
if (ForewordArray[0, y + 1] == "Normal")
{
forewordbkmkParent.InsertBeforeSelf(new Paragraph(new Run(new Text(ForewordArray[1, y]))));
}
else
{
fPara = forewordbkmkParent.InsertBeforeSelf(new Paragraph(new Run(new Text(ForewordArray[1, y]))));
}
}
else
{
fPara.InsertAfter(new Run(new Text(ForewordArray[1, y])), fPara.GetFirstChild<Run>());
}
}
else
{
NewRun = forewordbkmkParent.InsertBeforeSelf(new Run());
NewRunProps = new RunProperties();
NewRunProps.AppendChild<Italic>(new Italic());
NewRun.AppendChild<RunProperties>(NewRunProps);
NewRun.AppendChild(new Text(ForewordArray[1, y]));
}
}
}
}
but I end up with malformed XML because the runs are inserted after the paragraphs instead of inside them:
<w:p>
<w:r>
<w:t>This document is amongst a series of International Standards dealing with the conversion of systems of writing produced by Technical Committee ISO/TC 46, </w:t>
</w:r>
</w:p>
<w:r>
<w:rPr>
<w:i />
</w:rPr>
<w:t>Information and documentation</w:t>
</w:r>
<w:p>
<w:r>
<w:t>, WG 3 </w:t>
</w:r>
<w:r>
<w:t>.</w:t>
</w:r>
</w:p>
<w:r>
<w:rPr>
<w:i />
</w:rPr>
<w:t>Conversion of written languages</w:t>
</w:r>
Doing this the right way, using the SDK, would be best. As an alternative, I was able to create a string with all the correct XML and text using regexes, but I can't find a WordprocessingDocument method to turn that into an XML fragment that I can insert.
The solution for this kind of problem is to perform a pure functional transformation, as shown in the following code example.
The code example uses the sample XML element <p>
given in the question (see Xml
constant below). It transforms it into a corresponding Open XML w:p
element, i.e., a Paragraph
instance in terms of the strongly-typed classes provided by the Open XML SDK. The expected outer XML of that w:p
or Paragraph
is defined by the OuterXml
constant.
using System;
using System.Linq;
using System.Xml.Linq;
using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Wordprocessing;
using Xunit;
namespace CodeSnippets.Tests.OpenXml.Wordprocessing
{
public class XmlTransformationTests
{
private const string Xml =
@"<p id=""_fab91699-6d85-4ce5-b0b5-a17197520a7f"">" +
@"This document is amongst a series of International Standards dealing with the conversion of systems of writing produced by Technical Committee ISO/TC 46, " +
@"<em>Information and documentation</em>" +
@", WG 3 " +
@"<em>Conversion of written languages</em>" +
@"." +
@"</p>";
private const string OuterXml =
@"<w:p xmlns:w=""http://schemas.openxmlformats.org/wordprocessingml/2006/main"">" +
@"<w:r><w:t xml:space=""preserve"">This document is amongst a series of International Standards dealing with the conversion of systems of writing produced by Technical Committee ISO/TC 46, </w:t></w:r>" +
@"<w:r><w:rPr><w:i /></w:rPr><w:t>Information and documentation</w:t></w:r>" +
@"<w:r><w:t xml:space=""preserve"">, WG 3 </w:t></w:r>" +
@"<w:r><w:rPr><w:i /></w:rPr><w:t>Conversion of written languages</w:t></w:r>" +
@"<w:r><w:t>.</w:t></w:r>" +
@"</w:p>";
[Fact]
public void CanTransformXmlToOpenXml()
{
// Arrange, creating an XElement based on the given XML.
var xmlParagraph = XElement.Parse(Xml);
// Act, transforming the XML into Open XML.
var paragraph = (Paragraph) TransformElementToOpenXml(xmlParagraph);
// Assert, demonstrating that we have indeed created an Open XML Paragraph instance.
Assert.Equal(OuterXml, paragraph.OuterXml);
}
private static OpenXmlElement TransformElementToOpenXml(XElement element)
{
return element.Name.LocalName switch
{
"p" => new Paragraph(element.Nodes().Select(TransformNodeToOpenXml)),
"em" => new Run(new RunProperties(new Italic()), CreateText(element.Value)),
"b" => new Run(new RunProperties(new Bold()), CreateText(element.Value)),
_ => throw new ArgumentOutOfRangeException()
};
}
private static OpenXmlElement TransformNodeToOpenXml(XNode node)
{
return node switch
{
XElement element => TransformElementToOpenXml(element),
XText text => new Run(CreateText(text.Value)),
_ => throw new ArgumentOutOfRangeException()
};
}
private static Text CreateText(string text)
{
return new Text(text)
{
Space = text.Length > 0 && (char.IsWhiteSpace(text[0]) || char.IsWhiteSpace(text[^1]))
? new EnumValue<SpaceProcessingModeValues>(SpaceProcessingModeValues.Preserve)
: null
};
}
}
}
The above sample deals with <p>
(paragraph), <em>
(emphasis / italic), and <b>
(bold) elements. Adding further formatting elements (e.g., underlining) is easy.
Note that the sample code makes the simplifying assumption that <em>
, <b>
, and potentially further formatting elements are not nested. Adding the capability to nest those elements would make the sample code a little more complicated (but it's obviously possible).