I am making a program that will store its data in an XML file. When people write XML they can make subtle mistakes, like ending a comment with -
so it looks like <!-- comment --->
or adding a </>
inside an attribute. Naturally, the XML still can be read all right, but trying to input this text into XmlDocument will give a syntax error (and it wont be parsed).
Is there a way to make XmlDocument less strict and make it ignore violations of the standard that do not make the document unparseable? For example, its clear that <!-- comment --->
is still a comment even though it contains -
at the end which is against the standard specification).
No, XML parsers are expected to reject input that is not valid XML.
You may try your luck preprocessing the invalid files by Tidy, but better simply make sure the input is valid.
Here's an example usage. Tidy will fix your comments and do some escaping, but an extra opening < will break things up more often than not - guessing in that case is simply too much to ask.
Tidy tidy = new Tidy();
tidy.Options.FixComments = true;
tidy.Options.XmlTags = true;
tidy.Options.XmlOut = true;
string invalid = "<root>< <!--comment--->></root>";
MemoryStream input = new MemoryStream(Encoding.UTF8.GetBytes(invalid));
MemoryStream output = new MemoryStream();
tidy.Parse(input, output, new TidyMessageCollection());
// TODO check the messages
string repaired = Encoding.UTF8.GetString(output.ToArray());