Search code examples
c#xmlperformancelinq-to-xmlmemory-consumption

Reducing memory and increasing speed while parsing XML files


I have a directory with about 30 randomly named XML files. So the name is no clue about their content. And I need to merge all of these files into a single file according to predefined rules. And unfortunately, it is too complex to use simple stylesheets.
Each file can have up to 15 different elements within its root. So, I have 15 different methods that each take an XDocument as parameter and search for a specific element in the XML. It will then process that data. And because I call these methods in a specific order, I can assure that all data is processed in the correct order.
Example nodes are e.g. a list of products, a list of prices for specific product codes, a list of translations for product names, a list of countries, a list of discounts on product in specific country and much, much more. And no, these aren't very simple structures either.

Right now, I'm doing something like this:

List<XmlFileData> files = ImportFolder.EnumerateFiles("*.xml", SearchOption.TopDirectoryOnly).Select(f => new XDocument(f.FullName)).ToList();
files.ForEach(MyXml, FileInformation);
files.ForEach(MyXml, ParseComments);
files.ForEach(MyXml, ParsePrintOptions);
files.ForEach(MyXml, ParseTranslations);
files.ForEach(MyXml, ParseProducts);
// etc.
MyXml.Save(ExportFile.FullName);

I wonder if I can do this in a way that I have to read less in memory and generate a faster result. Speed is more important than memory, though. Thus, this solution works. I just need something faster that will use less memory.
Any suggestions?


Solution

  • One approach would be to create a separate List<XElement> for each of the different data types. For example:

    List<XElement> Comments = new List<XElement>();
    List<XElement> Options = new List<XElement>();
    // etc.
    

    Then for each document you can go through the elements in that document and add them to the appropriate lists. Or, in pseudocode:

    for each document
        for each element in document
            add element to the appropriate list
    

    This way you don't have to load all of the documents into memory at the same time. In addition, you only do a single pass over each document.

    Once you've read all of the documents, you can concatenate the different elements into your single MyXml document. That is:

    MyXml = create empty document
    Add Comments list to MyXml
    Add Options list to MyXml
    // etc.
    

    Another benefit of this approach is that if the total amount of data is larger than will fit in memory, those lists of elements could be files. You'd write all of the Comment elements to the Comments file, the Options to the Options file, etc. And once you've read all of the input documents and saved the individual elements to files, you can then read each of the element files to create the final XML document.