Search code examples
c#.netxmlio

Parsing xml file that comes in as one object per line


I haven't been here in so long, I forgot my prior account! Anyways, I am working on parsing an xml document that comes in ugly. It is for banking statements. Each line is a <statement>all tags</statement>. Now, what I need to do is read this file in, and parse the XML document at the same time, while formatting it more human readable too. Point beeing,

Original input looks like this:

<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>
<statement><accountHeader><fiAddress></fiAddress><accountNumber></accountNumber><startDate>20140101</startDate><endDate>20140228</endDate><statementGroup>1</statementGroup><sortOption>0</sortOption><memberBranchCode>1</memberBranchCode><memberName></memberName><jointOwner1Name></jointOwner1Name><jointOwner2Name></jointOwner2Name></summary></statement>

I need the final output to be as follows:

<statement>
    <name></name>
    <address></address>
</statement>

This is fine and dandy. I am using the following "very slow considering 5.1 million lines, 254k data file, and about 60k statements takes around 8 minutes".

foreach(String item in lines)
{
    XElement xElement = XElement.Parse(item);
    sr.WriteLine(xElement.ToString().Trim());
}

Then when the file is formatted this is what sucks. I need to check every single tag in transaction elements, and if a tag is missing that could be there, I have to fill it in. Our designer software will default prior values in if a tag is possible, and the current objects does not have. It defaults in the value of a prior one that was not Null. "I know, and they swear up and down it is not a bug... ok?"

So, that is also taking about 5 to 10 minutes. I need to break all this down, and find a faster method for working with the initial XML. This is a preprocess action, and cannot take that long if not necessary. It just seems redundant.

Is there a better way to parse the XML, or is this the best I can do? I parse the XML, write to a temp file, and then read that file in, to the output file inserting the missing tags. 2 IO runs for one process. Yuck.


Solution

  • You can start by trying a modified for loop to see if this speeds it up for you:

    XElement root = new XElement("Statements");
    
    foreach(String item in lines)
    {
        XElement xElement = XElement.Parse(item);
        root.Add(xElement);
    }
    
    sr.WriteLine(root.ToString().Trim());
    

    Well, I'm not sure if this will help with memory issues. If it works, you'll get multiple xml files.

    int fileCount=1;
    int count = 0;
    XElement root;
    Action Save = () => root.Save(string.Format("statements{0}.xml",fileCount++));
    
    while(count < lines.Length) // or lines.Count
    try
    {
        root = new XElement("Statements");
    
        foreach(String item in lines.Skip(count))
        {
            XElement xElement = XElement.Parse(item);
            root.Add(xElement);
            count++;
        }
        Save();
    }
    catch (OutOfMemoryException)
    {
        Save();
        root = null;
        GC.Collect();
    }