Search code examples
c#xmlwinformsfile.net-6.0

How can I optimize my C# WinForms code for splitting large XML files into smaller ones efficiently?


I have a large .XML file with 8mil nodes (around 1GB). Currently I am using XML reader to read the files and XML writer to write the files into new .XML files. Currently I have noticed that the speed is very inconsistent. If I make 800 000 nodes per file it takes about 10 min, if I say 80 000 nodes per file it will take 15-20 min, so it for some reason takes longer the more files it makes.

XML file:

<CNJExport>
<Item>
<ID>1</ID>
<name>Logitech MX Master 3 mouse</name>
<price>423.36</price>
</Item>
</CNJExport> 

Current code : This part is the code without file making and writing

public string XML_file_path;
        
BackgroundWorker bw;

public Form1()
{
    InitializeComponent();
}


private void AppendText(string text)
{
    if (Progress_output_text.InvokeRequired)
    {
        Progress_output_text.Invoke(new Action<string>(AppendText), text);
    }
    else
    {
        Progress_output_text.AppendText(text);
    }
}


private async void button1_Click(object sender, EventArgs e)
{
    Generate_Button.Enabled = false;

    // Start the background worker
    bw = new BackgroundWorker();
    bw.DoWork += (obj, ea) => TasksAsync(1);
    bw.RunWorkerCompleted += (obj, ea) => Generate_Button.Enabled = true; // Enable the button after the operation completes
    bw.RunWorkerAsync();
}

private async void TasksAsync(int times)
{
    string Error_code_save = "", file_name_and_type = File_Path_Textbox.Text.Substring(File_Path_Textbox.Text.LastIndexOf('\\') + 1), full_file_path, file_name;
    int number_is_not_devisable = 0, total_items_in_XML = 0, Current_item_line = 0, filesCreated = 0, total_files_at_end;

    XML_file_path = File_Path_Textbox.Text;

    if (Number_Of_Elements_Per_file.Value == 0)
    {
        Error_code_save += "ERROR: Number of XML files cannot be zero\r";
    }
    else if (!string.IsNullOrEmpty(Error_code_save))
    {
        Progress_output_text.Invoke((MethodInvoker)delegate
        {
            Progress_output_text.Text += "You set " + Number_Of_Elements_Per_file.Value + " items per file\r";
        });
    }

    if (string.IsNullOrEmpty(File_Path_Textbox.Text) || string.IsNullOrEmpty(File_Destination_Textbox.Text))
    {
        Error_code_save += "ERROR: Path and/or Destination have not been set, please set them and generate again";
    }

    if (!string.IsNullOrEmpty(Error_code_save))
    {
        MessageBox.Show(Error_code_save);
        return;
    }

    file_name = file_name_and_type.Substring(0, file_name_and_type.Length - 4);

    using (XmlReader reader = XmlReader.Create(XML_file_path))
    {
        while (reader.Read())
        {
            if (reader.NodeType == XmlNodeType.Element && reader.Name == "Item")
                total_items_in_XML++;
        }
    }

    if (total_items_in_XML % Number_Of_Elements_Per_file.Value > 0)
    {
        number_is_not_devisable = 1;
    }

    total_files_at_end = (int)Math.Ceiling(total_items_in_XML / Number_Of_Elements_Per_file.Value);

    for (int i = 1; i <= total_files_at_end; i++)
    {
        int progressValue = (int)(i * 100.0 / total_files_at_end);

        progressBar1.Invoke((MethodInvoker)delegate
        {
            progressBar1.Value = progressValue;
        });

        full_file_path = string.Concat(File_Destination_Textbox.Text, '\\', file_name, i, ".xml");
        try
        {
            create_file(full_file_path, Current_item_line);

            filesCreated++;

            // Reset the progress bar after creating 10 files
            if (filesCreated % 10 == 0)
            {
                await Task.Delay(2000); // Wait for 2 seconds
                progressBar1.Invoke((MethodInvoker)delegate
                {
                    progressBar1.Value = 0;
                });
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine("An error occurred while creating the file: " + ex.Message);
        }

        Current_item_line += (int)Number_Of_Elements_Per_file.Value;
    }

    Progress_output_text.Invoke((MethodInvoker)delegate
    {
        Progress_output_text.Text += filesCreated + " files have been created in " + File_Destination_Textbox.Text + "\r\n";
    });
}

This part is the actual create file function

public void create_file(string full_file_path, int Current_item_line)
{
    using (XmlWriter writer = XmlWriter.Create(full_file_path))
    {
        writer.WriteStartDocument();
        writer.WriteStartElement("CNJExport");

        using (XmlReader reader = XmlReader.Create(XML_file_path))
        {
            int itemCounter = 0;
            // Loop through the XML file and copy selected items to the new file
            while (reader.Read())
            {
                if (reader.NodeType == XmlNodeType.Element && reader.Name == "Item")
                {
                    if (itemCounter >= Current_item_line && itemCounter < Current_item_line + Number_Of_Elements_Per_file.Value)
                    {
                        writer.WriteNode(reader, true);
                    }
                    itemCounter++;
                }
            }
        }

        writer.WriteEndElement();
        writer.WriteEndDocument();
    }
}

Solution

  • As noted in comments by Palle Due, you have two basic problems here:

    1. You are reading your input file once for each output file fragment.

      Instead, you should stream through your input file only once, and create output files dynamically as <Item> nodes are encountered.

    2. Your progress tracker includes a 2 second delay.

      You should eliminate that, and instead only update the progress tracker when necessary, e.g. when more than 2 seconds have elapsed since the previous update, or if more than 10% progress has been made.

    For issue #1, I rewrote your create_file() as follows, using Pascal Casing instead of snake casing as per C# naming conventions:

    public static class XmlExtensions
    {
        public static void SplitXmlFile(string inputFilePath, int maxItemsPerFile, XName rootName, XName itemName,
                                        Func<long, string> makeOutputFileFullPath, Action<long, long, long>? progressTracker, 
                                        FileMode fileMode = FileMode.CreateNew, 
                                        XmlReaderSettings? inputSettings = default, XmlWriterSettings? outputSettings = default) 
        {
            if (string.IsNullOrEmpty(inputFilePath) || maxItemsPerFile < 1 || rootName == null || itemName == null)
                throw new ArgumentException(); // TODO - throw more descriptive exceptions.
            
            void OpenOutput(out Stream outStream, out XmlWriter writer, ref long fileIndex)
            {
                var path = makeOutputFileFullPath(++fileIndex);
                outStream = new FileStream(path, fileMode);
                writer = XmlWriter.Create(outStream, outputSettings);
                writer.WriteStartElement(rootName.LocalName, rootName.NamespaceName);
            }
            void CloseOutput(ref Stream? outStream, ref XmlWriter? writer, long fileIndex, long streamPosition, long streamLength)
            {
                writer?.WriteEndElement();
                writer?.Dispose();
                outStream?.Dispose();
                (writer, outStream) = (null, null);
                // Inform the caller of the approximate progress by passing in the input stream length and position.  
                // Due to buffering, inStream.Position may be up to 4K bytes ahead of the actual reader position, 
                // but for UI progress tracking purposes this is probably fine.
                progressTracker?.Invoke(streamPosition, streamLength, fileIndex);
            }
            
            Stream? outStream = null;
            XmlWriter? writer = null;
            long fileIndex = 0;
            using (var inStream = File.OpenRead(inputFilePath))
            using (var reader = XmlReader.Create(inStream, inputSettings))
            {
                var streamLength = inStream.Length;
                try
                {
                    uint currentCount = 0;
                    // Loop through the XML file and, for each maxItemsPerFile chunk of items, create a new file and copy them into it.
                    while (reader.Read())
                    {
                        if (reader.NodeType == XmlNodeType.Element && reader.LocalName == itemName.LocalName && reader.NamespaceURI == itemName.NamespaceName)
                        {
                            if (currentCount >= maxItemsPerFile)
                            {
                                CloseOutput(ref outStream, ref writer, fileIndex, inStream.Position, streamLength);
                                Debug.Assert(writer == null);
                            }
                            if (writer == null)
                            {
                                OpenOutput(out outStream, out writer, ref fileIndex);
                                currentCount = 0;
                            }
                            // ReadSubtree() ensures the reader is positioned at the EndElement node, not the next node
                            using (var subReader = reader.ReadSubtree())
                                writer.WriteNode(subReader, true);      
                            currentCount++;
                        }
                    }
                }
                finally
                {
                    CloseOutput(ref outStream, ref writer, fileIndex, streamLength, streamLength);
                }
            }
        }
    }
    

    And then modify your button1_Click() to look something like:

    private async void button1_Click(object sender, EventArgs e)
    {
        await SplitSelectedFile();
    }
    
    private async Task SplitSelectedFile()
    {
        // Collect information from the GUI on the main thread.
        string inputFilePath = File_Path_Textbox.Text;
        string outputFilePrefix = Path.GetFileNameWithoutExtension(File_Path_Textbox.Text);
        string outputFileDestination = File_Destination_Textbox.Text;
        int maxItemsPerFile = (int)Number_Of_Elements_Per_file.Value;
    
        // Disable the Generate_Button while processing
        Generate_Button.Enabled = false; 
    
        List<string> outputFiles = new();
    
        // Split on the background thread
        Action doSplit = () =>
        {
            // TODO: Error handling in the event that the input file is missing or malformed, or we run out of disk space while writing the output files.
            // For instance, if the input file is malformed, you might want to delete all the output files.
            
            Console.WriteLine("Started");
            Func<long, string> makeOutputFileFullPath = (i) =>
            {
                var path = Path.Combine(outputFileDestination, string.Concat(outputFilePrefix, i, ".xml"));
                outputFiles.Add(path);
                return path;
            };
    
            int lastPercentDone = -1;
            DateTime lastDateTime = default;
            Action<long, long, long>? progressTracker = (position, length, fileNumber) =>
            {
                var percentDone = (int)((double)position / (double)length * 100);
                if (percentDone != lastPercentDone)
                {
                    var dateTime = DateTime.UtcNow;
                    // Update the progress bar if two seconds have passed or the percentage has changed by 10%.
                    if ((dateTime - lastDateTime).TotalSeconds > 2.0 || percentDone > lastPercentDone + 10)
                    {
                        progressBar1.InvokeIfRequired(() => progressBar1.Value = percentDone);
                        lastDateTime = dateTime;
                        lastPercentDone = percentDone;
                    }
                }
            };
    
            // Force the output to be indented, or not, as perferred.
            XmlReaderSettings inputSettings = new() { IgnoreWhitespace = true };
            XmlWriterSettings outputSettings = new() { Indent = false };
    
            XmlExtensions.SplitXmlFile(inputFilePath, maxItemsPerFile, "CNJExport", "Item", 
                                       makeOutputFileFullPath, progressTracker, 
                                       inputSettings : inputSettings, outputSettings : outputSettings);
        };
    
        // Update the UI after split on the main thread.
        Action<Task> onCompleted = (_) => 
        {
            // Re-enable  the Generate_Button when processing is complete.
            Generate_Button.InvokeIfRequired(() => Generate_Button.Enabled = false);
            Progress_output_text.InvokeIfRequired(
                () =>
                {
                    Progress_output_text.Text += outputFiles.Count + " files have been created in " + outputFileDestination + "\r\n";
                });
            
            // If required, loop through the created files and do something
            foreach (var file in outputFiles)
            {
                // Add the file to some master list of files, show it in the UI, etc etc.
            }
        };
        
        await Task
            .Run(doSplit)
            .ContinueWith(onCompleted, TaskScheduler.FromCurrentSynchronizationContext());
    }
        
    

    Using the following extension method to simplify cross-thread Control invocation:

    public static class ControlExtensions
    {
        public static void InvokeIfRequired(this Control control, MethodInvoker invoker)
        {
            if (control.InvokeRequired)
                control.Invoke(invoker);
            else
                invoker();
        }
    }
    

    Notes:

    • To simplify progress tracking, I track the percentage of the input stream read. Due to XmlReader buffering this may be wrong by up to 4K bytes, but since your files are 1GB in size this should be more than accurate enough.

    • My version of SplitXmlFile() does not require you to precompute the total number of output files beforehand, and does not require that the maximum number of items per output file evenly divide the number of items.

    • You may want to improve exception handling in the event that the input file is malformed, or the output files cannot be written.

    • I passed FileMode.CreateNew to avoid overwriting any preexisting split fragments. Pass FileMode.Create if you want them to be overwritten.

    • I haven't done asynchronous WinForms programming in .NET 6 so there might be some mistake with my use of asynchrony and threading.

    • Rather than manually extracting and combining file names using string methods, use the utilities from the Path class.

    • Consider rewriting your code to use C#'s standard naming conventions.

    Demo fiddle here.