Search code examples
c#.netxmlxmlreaderxmlwriter

(C#) How to modify attribute's value in the existing XML file without loading or rewriting the whole file?


I'm making some huge XML files (several GB) with the help of XmlWriter and Linq2Xml. This files are of type:

<Table recCount="" recLength="">
<Rec recId="1">..</Rec>
<Rec recId="2">..</Rec>
..
<Rec recId="n">..</Rec>
</Table>

I don't know values for Table's recCount and recLength attributes until I write all the inner Rec nodes, so I have to write values to these attributes at the very end.

Right now I'm writing all the inner Rec nodes to a temp file, calculate Table's attributes' values and write everything the way I've shown above to a resulting file. (copying everything from the temp file with all the Rec nodes)

I'm wondering if there is a way to modify these attributes' values without writing stuff to another file (like I do it right now) or loading the whole document into memory (which is obviously not possible due to size of these files)?


Solution

  • Heavily commented code. The basic idea is that in the first pass we write:

    <?xml version="1.0" encoding="utf-8"?>
    <Table recCount="$1" recLength="$2">
    <!--Reserved space:++++++++++++++++-->
    <Rec...
    

    Then we go back to the beginning of the file and we rewrite the first three lines:

    <?xml version="1.0" encoding="utf-8"?>
    <Table recCount="1000" recLength="150">
    <!--Reserved space:#############-->
    

    The important "trick" here is that you can't "insert" into a file, you can only overwrite it. So we "reserve" some space for the digits (the Reserved space:#############. comment. There are many many ways we could have done it... For example, in the first pass we could have:

    <Table recCount="              " recLength="          ">
    

    and then (xml-legal but ugly):

    <Table recCount="1000          " recLength="150       ">
    

    Or we could have appended the space after the > of Table:

    <Table recCount="" recLength="">                   
    

    (there are 20 spaces after the >)

    Then:

    <Table recCount="1000" recLength="150">            
    

    (now there are are 13 spaces after the >)

    Or we could have simply added the spaces without the <!-- --> on a new line...

    The code:

    int maxRecCountLength = 10; // int.MaxValue.ToString().Length
    int maxRecLengthLength = 10; // int.MaxValue.ToString().Length
    int tokenLength = 4; // 4 == $1 + $2, see below what $1 and $2 are
    // Note that the reserved space will be in the form +++++++++++++++++++
    
    string reservedSpace = new string('+', maxRecCountLength + maxRecLengthLength - tokenLength); 
    
    // You have to manually open the FileStream
    using (var fs = new FileStream("out.xml", FileMode.Create))
    
    // and add a StreamWriter on top of it
    using (var sw = new StreamWriter(fs, Encoding.UTF8, 4096, true))
    {
        // Here you write on your StreamWriter however you want.
        // Note that recCount and recLength have a placeholder $1 and $2.
        int recCount = 0;
        int maxRecLength = 0;
    
        using (var xw = XmlWriter.Create(sw))
        {
            xw.WriteWhitespace("\r\n");
            xw.WriteStartElement("Table");
            xw.WriteAttributeString("recCount", "$1");
            xw.WriteAttributeString("recLength", "$2");
    
            // You have to add some white space that will be 
            // partially replaced by the recCount and recLength value
            xw.WriteWhitespace("\r\n");
            xw.WriteComment("Reserved space:" + reservedSpace);
    
            // <--------- BEGIN YOUR CODE
            for (int i = 0; i < 100; i++)
            {
                xw.WriteWhitespace("\r\n");
                xw.WriteStartElement("Rec");
    
                string str = string.Format("Some number: {0}", i);
                if (str.Length > maxRecLength)
                {
                    maxRecLength = str.Length;
                }
                xw.WriteValue(str);
    
                recCount++;
    
                xw.WriteEndElement();
            }
            // <--------- END YOUR CODE
    
            xw.WriteWhitespace("\r\n");
            xw.WriteEndElement();
        }
    
        sw.Flush();
    
        // Now we read the first lines to modify them (normally we will
        // read three lines, the xml header, the <Table element and the
        // <-- Reserved space:
        fs.Position = 0;
    
        var lines = new List<string>();
    
        using (var sr = new StreamReader(fs, sw.Encoding, false, 4096, true))
        {
            while (true)
            {
                string str = sr.ReadLine();
                lines.Add(str);
    
                if (str.StartsWith("<Table"))
                {
                    // We read the next line, the comment line
                    str = sr.ReadLine();
                    lines.Add(str);
                    break;
                }
            }
        }
    
        string strCount = XmlConvert.ToString(recCount);
        string strMaxRecLength = XmlConvert.ToString(maxRecLength);
    
        // We do some replaces for the tokens
        int oldLen = lines[lines.Count - 2].Length;
        lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$1\"", string.Format("=\"{0}\"", strCount));
        lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$2\"", string.Format("=\"{0}\"", strMaxRecLength));
        int newLen = lines[lines.Count - 2].Length;
    
        // Remove spaces from reserved whitespace
        lines[lines.Count - 1] = lines[lines.Count - 1].Replace(":" + reservedSpace, ":" + new string('#', reservedSpace.Length - newLen + oldLen));
    
        // We move back to just after the UTF8/UTF16 preamble
        fs.Position = sw.Encoding.GetPreamble().Length;
    
        // And we rewrite the lines
        foreach (string str in lines)
        {
            sw.Write(str);
            sw.Write("\r\n");
        }
    }
    

    Slower .NET 3.5 way

    In .NET 3.5 the StreamReader/StreamWriter want to close the base FileStream, so I have to reopen various times the file. This is a little little slower.

    int maxRecCountLength = 10; // int.MaxValue.ToString().Length
    int maxRecLengthLength = 10; // int.MaxValue.ToString().Length
    int tokenLength = 4; // 4 == $1 + $2, see below what $1 and $2 are
                            // Note that the reserved space will be in the form +++++++++++++++++++
    
    string reservedSpace = new string('+', maxRecCountLength + maxRecLengthLength - tokenLength);
    string fileName = "out.xml";
    
    int recCount = 0;
    int maxRecLength = 0;
    
    using (var sw = new StreamWriter(fileName))
    {
        // Here you write on your StreamWriter however you want.
        // Note that recCount and recLength have a placeholder $1 and $2.
        using (var xw = XmlWriter.Create(sw))
        {
            xw.WriteWhitespace("\r\n");
            xw.WriteStartElement("Table");
            xw.WriteAttributeString("recCount", "$1");
            xw.WriteAttributeString("recLength", "$2");
    
            // You have to add some white space that will be 
            // partially replaced by the recCount and recLength value
            xw.WriteWhitespace("\r\n");
            xw.WriteComment("Reserved space:" + reservedSpace);
    
            // <--------- BEGIN YOUR CODE
            for (int i = 0; i < 100; i++)
            {
                xw.WriteWhitespace("\r\n");
                xw.WriteStartElement("Rec");
    
                string str = string.Format("Some number: {0}", i);
                if (str.Length > maxRecLength)
                {
                    maxRecLength = str.Length;
                }
                xw.WriteValue(str);
    
                recCount++;
    
                xw.WriteEndElement();
            }
            // <--------- END YOUR CODE
    
            xw.WriteWhitespace("\r\n");
            xw.WriteEndElement();
        }
    }
    
    var lines = new List<string>();
    
    using (var sr = new StreamReader(fileName))
    {
        // Now we read the first lines to modify them (normally we will
        // read three lines, the xml header, the <Table element and the
        // <-- Reserved space:
    
        while (true)
        {
            string str = sr.ReadLine();
            lines.Add(str);
    
            if (str.StartsWith("<Table"))
            {
                // We read the next line, the comment line
                str = sr.ReadLine();
                lines.Add(str);
                break;
            }
        }
    }
    
    // We have to use the Stream overload of StreamWriter because
    // we want to modify the text!
    using (var fs = File.OpenWrite(fileName))
    using (var sw = new StreamWriter(fs))
    {
        string strCount = XmlConvert.ToString(recCount);
        string strMaxRecLength = XmlConvert.ToString(maxRecLength);
    
        // We do some replaces for the tokens
        int oldLen = lines[lines.Count - 2].Length;
        lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$1\"", string.Format("=\"{0}\"", strCount));
        lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$2\"", string.Format("=\"{0}\"", strMaxRecLength));
        int newLen = lines[lines.Count - 2].Length;
    
        // Remove spaces from reserved whitespace
        lines[lines.Count - 1] = lines[lines.Count - 1].Replace(":" + reservedSpace, ":" + new string('#', reservedSpace.Length - newLen + oldLen));
    
        // We move back to just after the UTF8/UTF16 preamble
        sw.BaseStream.Position = sw.Encoding.GetPreamble().Length;
    
        // And we rewrite the lines
        foreach (string str in lines)
        {
            sw.Write(str);
            sw.Write("\r\n");
        }
    }