Search code examples
c#xmlxelement

C# junk characters break XElement "pretty" representation


I have occasionally run across XML with some junk characters tossed in between the elements, which appears to be confusing whatever internal XNode/XElement method handles prettifying the Element.

The following...

var badNode = XElement.Parse(@"<b>+
  <inner1/>
  <inner2/>
</b>"

prints out

<b>+
  <inner1 /><inner2 /></b>

while this...

var badNode = XElement.Parse(@"<b>
  <inner1/>
  <inner2/>
</b>"

gives the expected

<b>
  <inner1 />
  <inner2 />
</b>

According to the debugger, the junk character gets parsed in as the XElement's "NextNode" property, which then apparently assigns the remaining XML as its "NextNode", causing the single line prettifying.

Is there any way to prevent/ignore this behavior, short of pre-screening the XML for any errant characters in between tag markers?


Solution

  • You getting awkward indentation for badNode because, by adding the non-whitespace + character into the <b> element value, the element now contains mixed content, which is defined by the W3C as follows:

    3.2.2 Mixed Content

    [Definition: An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.]

    The presence of mixed content inside an element triggers special formatting rules for XmlWriter (which is used internally by XElement.ToString() to actually write itself to an XML string) that are explained in the documentation remarks for XmlWriterSettings.Indent:

    This property only applies to XmlWriter instances that output text content; otherwise, this setting is ignored.

    The elements are indented as long as the element does not contain mixed content. Once the WriteString or WriteWhitespace method is called to write out a mixed element content, the XmlWriter stops indenting. The indenting resumes once the mixed content element is closed.

    This explains the behavior you are seeing.

    As a workaround, parsing your XML with LoadOptions.PreserveWhitespace, which preserves insignificant white space while parsing, might be what you want:

    var badNode = XElement.Parse(@"<b>+
      <inner1/>
      <inner2/>
    </b>",          
                                 LoadOptions.PreserveWhitespace);
    Console.WriteLine(badNode);
    

    Which outputs:

    <b>+
      <inner1 />
      <inner2 />
    </b>
    

    Demo fiddle #1 here.

    Alternatively, if you are sure that badNode should not have character data, you could strip it manually after parsing:

    badNode.Nodes().OfType<XText>().Remove();
    

    Now badNode will no longer contain mixed content and XmlWriter will indent it nicely.

    Demo fiddle #2 here.