Search code examples
c#.netxmlxmltextwritercontrol-characters

XmlTextWriter incorrectly writing control characters


.NET's XmlTextWriter creates invalid xml files.

In XML, some control characters are allowed, like 'horizontal tab' (	), but others are not, like 'vertical tab' (). (See spec.)

I have a string which contains a UTF-8 control character that is not allowed in XML.
Although XmlTextWriter escapes the character, the resulting XML is ofcourse still invalid.

How can I make sure that XmlTextWriter never produces an illegal XML file?

Or, if it's not possible to do this with XmlTextWriter, how can I strip the specific control characters that aren't allowed in XML from a string?

Example code:

using (XmlTextWriter writer =
  new XmlTextWriter("test.xml", Encoding.UTF8))
{
  writer.WriteStartDocument();
  writer.WriteStartElement("Test");
  writer.WriteValue("hello \xb world");
  writer.WriteEndElement();
  writer.WriteEndDocument();
}

Output:

<?xml version="1.0" encoding="utf-8"?><Test>hello &#xB; world</Test>

Solution

  • This documentation of a behaviour is hidden in the documentation of the WriteString method but it sounds like it applies to the whole class.

    The default behavior of an XmlWriter created using Create is to throw an ArgumentException when attempting to write character values in the range 0x-0x1F (excluding white space characters 0x9, 0xA, and 0xD). These invalid XML characters can be written by creating the XmlWriter with the CheckCharacters property set to false. Doing so will result in the characters being replaced with numeric character entities (&#0; through &#0x1F). Additionally, an XmlTextWriter created with the new operator will replace the invalid characters with numeric character entities by default.

    So it seems that you end up writing invalid characters because you are using the XmlTextWriter class. A better solution for you would be to use the XmlWriter Class instead.