Search code examples
c#.netxmldocumentsystem.xml

XmlDocument: whitespace handling and normalization


I have a bunch of questions related to whitespace handling with XmlDocument. Please see the numbered comments in the example below.

  1. Shouldn't all whitespace be significant in mixed mode? Why the space between the a tags is not significant?

  2. While I understand that the actual whitespace element is still an XmlWhitespace, how do I normalize these spaces into XmlSignificantWhitespace nodes? Normalize() doesn't work.

  3. Is my only option to do it manually?

Here's my test case:

private static void Main()
{
    // 1. Shouldn't all whitespace be significant in mixed mode? Why the space between the a tags is not significant?
    var doc = new XmlDocument
    {
        InnerXml = "<root>test1 <a>test2</a> <a>test3</a></root>",
    };
    PrintDoc(doc);

    // 2.a. While I understand that the actual whitespace element is still XmlWhitespace, how do I normalize these spaces into XmlSignificantWhitespaces?
    doc.DocumentElement.RemoveAll();
    doc.DocumentElement.SetAttribute("xml:space", "preserve");
    var fragment = doc.CreateDocumentFragment();
    fragment.InnerXml = "test1 <a>test2</a> <a>test3</a>";
    doc.DocumentElement.PrependChild(fragment);
    PrintDoc(doc);

    // 2.b. Normalize doesn't work
    doc.Normalize();
    PrintDoc(doc);

    // 3.a. Manual normalization does work, is there a better way?
    doc.DocumentElement.RemoveAllAttributes();
    var whitespaces = doc.DocumentElement.ChildNodes.Cast<XmlNode>()
        .OfType<XmlWhitespace>()
        .ToList();
    foreach (var whitespace in whitespaces)
    {
        var significant = doc.CreateSignificantWhitespace(whitespace.Value);
        doc.DocumentElement.ReplaceChild(significant, whitespace);
    }
    PrintDoc(doc);

    // 3.b. Reading from string also works
    doc.InnerXml = "<root xml:space=\"preserve\">test1 <a>test2</a> <a>test3</a></root>";
    PrintDoc(doc);
}

private static void PrintDoc(XmlDocument doc)
{
    var nodes = doc.DocumentElement.ChildNodes.Cast<XmlNode>().ToList();
    var whitespace = nodes.OfType<XmlWhitespace>().Count();
    var significantWhitespace = nodes.OfType<XmlSignificantWhitespace>().Count();

    Console.WriteLine($"Xml: {doc.InnerXml}\nwhitespace: {whitespace}\nsignificant whitespace: {significantWhitespace}\n");
}

The output is following:

Xml: <root>test1 <a>test2</a><a>test3</a></root>
whitespace: 0
significant whitespace: 0

Xml: <root xml:space="preserve">test1 <a>test2</a> <a>test3</a></root>
whitespace: 1
significant whitespace: 0

Xml: <root xml:space="preserve">test1 <a>test2</a> <a>test3</a></root>
whitespace: 1
significant whitespace: 0

Xml: <root>test1 <a>test2</a> <a>test3</a></root>
whitespace: 0
significant whitespace: 1

Xml: <root xml:space="preserve">test1 <a>test2</a> <a>test3</a></root>
whitespace: 0
significant whitespace: 1

Solution

  • Writing your own XmlNodeReader seems to work, although it is not the "cleanest" solution.

    Consider the current implementation here:

    public virtual XmlNodeType MoveToContent() {
        do {
            switch (this.NodeType) {
                case XmlNodeType.Attribute:
                    MoveToElement();
                    goto case XmlNodeType.Element;
                case XmlNodeType.Element:
                case XmlNodeType.EndElement:
                case XmlNodeType.CDATA:
                case XmlNodeType.Text:
                case XmlNodeType.EntityReference:
                case XmlNodeType.EndEntity:
                    return this.NodeType;
            }
        } while (Read());
        return this.NodeType;
    }
    

    To get mark SignificantWhitespace as content, you may return the NodeType when it is XmlNodeType.SignificantWhitespace.

    Here's the complete implementation of my own WhitespaceXmlNodeReader:

    internal class WhitespaceXmlNodeReader : XmlNodeReader
    {
        public WhitespaceXmlNodeReader(XmlNode node)
            : base(node)
        {
        }
    
        public override XmlNodeType MoveToContent()
        {
            do
            {
                switch (NodeType)
                {
                    case XmlNodeType.Attribute:
                        MoveToElement();
                        goto case XmlNodeType.Element;
                    case XmlNodeType.Element:
                    case XmlNodeType.EndElement:
                    case XmlNodeType.CDATA:
                    case XmlNodeType.Text:
                    case XmlNodeType.EntityReference:
                    case XmlNodeType.EndEntity:
                    // This was added:
                    case XmlNodeType.SignificantWhitespace:
                        return NodeType;
                }
            } while (Read());
            return NodeType;
        }
    }