c#xml performance datacontractserializer

Filtering a big number of XmlNodes using DataContractSerializer is very slow

Considering the code below

List<CustomConversionData> Filter(XmlNodeList nodeList)
{
    var filteredResults= new List<CustomConversionData>();
    //Deserailze the data:
    foreach (XmlNode item in nodeList)
    {
      try
      {
        CustomConversionData obj = Deserialize<CustomConversionData>(item.ParentNode.OuterXml);
        filteredResults.Add(obj);

      }
      catch
      {
          try
          {                                            
              CustomConversionData obj = Deserialize<CustomConversionData>(item.OuterXml);
              filteredResults.Add(obj);
          }
          catch (Exception e) {
          }
        }
    }
   return filteredResults;
}

and the method that do the deserialization

public T Deserialize<T>(string rawXml)
{
    using (XmlReader reader = XmlReader.Create(new StringReader(rawXml)))
    {
        DataContractSerializer formatter =
            new DataContractSerializer(typeof(T));
        return (T)formatter.ReadObject(reader);
    }
}

When I run this for a nodeList which consists of 8000 nodes, it takes around 6 hours. I was looking for a way to cut down this time and in the beginning I thought maybe I can create a task for each iteration, but it became slower than before and I think it was because of the overhead of switching between the tasks.

I was wondering what would be the best way to improve the performance of this code as it seems to be very CPU and memory intensive?

Solution

In your Deserialize method, I would make formatter a static member as it will be the same every time (and I'm not sure it caches internally). Then use .ReadObject(new StringReader(rawXml)) to save the added complexity of introducing an XmlReader.

Then it gets tricky. Try not to use exception handling to control your logic, so do some other check first rather than letting it throw and catching it.

Finally, I think the biggest win would be to not take an XmlNodeList, but to take a Stream and create an XmlReader to scan the XML and only deserialize what it needs, when it needs it. The upfront cost of having all of those object graphs is going to add up fast.

Edit: Another suggestion, change the signature to IEnumerable<CustomConversionData> and yield return where you would do .Add(...), this way the consuming code can stream the results, keeping peak memory usage way down.

Edit2: selecting the ParentNode first each time strikes me as odd. If the XmlNodeList comes from a .ChildNodes call, then you'll be deserializing the same thing over and over. If it is from SelectNodes("...") then you might be able to be smarter about selecting the correct node with the XPath to begin with and not have to get the parent. If you still need to do it this way, create an XmlReader here, check the element name, and decide from that if you need the parent. If you have the correct element then you can pass the XmlReader into Derserialize, saving another copy.