Search code examples
c#.netxmlxmlwriter

Why does XmlWriter not always format the XML as specified in XmlWriterSettings?


BACKGROUND
I get a lot of xml files that contain no newlines and to quickly format them I use the function below.

SCENARIO
When I run the tool the first time on a file that contains no newlines (and no insignificant whitespace) then it works as expected:

Convert("myfile.xml", "  ");

If I run the tool again on the same file that I just formatted, to increase the indent, then the indent isn't changed:

Convert("myfile.xml", "    ");

QUESTION
Why is the file not formatted the second time I run the function? How do I make sure the function always formats the file?

public static void Convert(string filename, string indent)
{
    var input_string = File.ReadAllText(filename, Encoding.UTF8);
    var settings = new XmlWriterSettings
    {
        NewLineHandling = NewLineHandling.Entitize,
        Indent = true,
        IndentChars = indent,
        NewLineChars = Environment.NewLine
    };
    var sb = new StringBuilder();
    using (var reader = XmlReader.Create(new StringReader(input_string)))
    using (var writer = XmlWriter.Create(sb, settings))
    {
        writer.WriteNode(reader, false);
        writer.Close();
    }
    File.Delete(filename);
    Encoding utf8 = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
    File.WriteAllText(filename, sb.ToString(), utf8);
}

NOTE
If I modify the reader to ignore whitespace then the writer can format the output correctly:

XmlReader.Create(new StringReader(input_string),
                 new XmlReaderSettings { IgnoreWhitespace = true })

But I still wonder why the writer fails to format the output when there is insignificant whitespace between the tags.


Solution

  • The problem is that if the reader is preserving insignificant whitespace then as far as the writer is concerned, that is now significant whitespace.

    So it can't add more whitespace as that would change the meaning, or at least,it doesn't seem to check that the inner text being written is only whitespace.

    So the correct thing to do is indeed to strip the whitespace first and rewrite it, using the code you mention new XmlReaderSettings { IgnoreWhitespace = true })

    On a side note, it is more efficient to just pass through the streams, rather than using strings and stringbuilders. I appreciate you are overwriting the file, so you need to put the existing one into a byte array

    var input = File.ReadAllBytes(filename);
    var settings = new XmlWriterSettings
    {
        NewLineHandling = NewLineHandling.Entitize,
        Indent = true,
        IndentChars = indent,
        NewLineChars = Environment.NewLine
    };
    Encoding utf8 = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);
    
    using (var mem = new MemoryStream(input))
    using (var sr = new StreamReader(mem, Encoding.UTF8))
    using (var reader = XmlReader.Create(sr, new XmlReaderSettings { IgnoreWhitespace = true }))
    using (var fs = File.Open(filename, FileMode.Create, FileAccess.Write, FileShare.None))
    using (var sw = new StreamWriter(fs, utf8))
    using (var writer = XmlWriter.Create(sw, settings))
    {
        writer.WriteNode(reader, false);
    }
    

    You should also ideally write the preamble

        writer.WriteStartDocument();