Search code examples
c#xmlxpathcdataxpath-1.0

Why does XPath over a System.Xml.XmlDocument not find adjacent text and CData nodes?


Why does XPath over a System.Xml.XmlDocument not find adjacent text and CData nodes?

var raw_xml = @"
<root>
    <test>
        <![CDATA[This is a CDATA node]]>And this is an adjacent text node
    </test>
</root>
";

var doc = new XmlDocument();
doc.LoadXml(raw_xml);

var results = doc.SelectNodes("/root/test/text()");
Console.WriteLine(results.Count); // gives: 1
Console.WriteLine(results[0].Value); // gives: This is a CDATA node
Console.WriteLine(results[0].Name); // gives: #cdata-section
Console.WriteLine(results[0].GetType().FullName); // gives: System.Xml.XmlCDataSection
Console.WriteLine(results[0].NextSibling.Name); // gives: #text
Console.WriteLine(results[0].NextSibling.Value.Trim()); // gives: And this is an adjacent text node

We can see from the above that the CDATA node has the text node as it's next sibling, so you would think that the expression /root/test/text() would find it.

same results with XPath: /root/test/node()


Solution

  • When working with XML documents, you are probably used to the Document Object Model (DOM), where CDATA nodes are separate to text nodes. The XPath data model sees text() nodes as all adjacent CDATA and text DOM node siblings together.

    Therefore, trying to write a query that will a specific DOM text/CDATA node that is not the first of an adjacent series will fail, for example:

    var results = doc.SelectNodes("/root/test/text()[starts-with(., 'And')]");
    Console.WriteLine(results.Count); // gives: 0
    

    and indeed, trying to select the "text" DOM node by other XPath means:

    var results = doc.SelectNodes("/root/test/text()[contains(., 'text node')]");
    

    will give the same results as the initial /root/test/text() query in the OP.

    What you are seeing is a mix of the two models - the result from the XPath query is translated back into a DOM node; so it gives you the first text() node, which in this case, is the CDATA node.

    If you really need to work with separate text and CDATA nodes in XPath, you will need to ensure that an XML comment separates the nodes in the source document, like this:

    <root>
        <test>
            <![CDATA[This is a CDATA node]]><!-- separator comment -->And this is an adjacent text node
        </test>
    </root>
    

    so that

    var results = doc.SelectNodes("/root/test/text()");
    Console.WriteLine(results.Count);
    

    will give 2.