Considering the code below
List<CustomConversionData> Filter(XmlNodeList nodeList)
{
var filteredResults= new List<CustomConversionData>();
//Deserailze the data:
foreach (XmlNode item in nodeList)
{
try
{
CustomConversionData obj = Deserialize<CustomConversionData>(item.ParentNode.OuterXml);
filteredResults.Add(obj);
}
catch
{
try
{
CustomConversionData obj = Deserialize<CustomConversionData>(item.OuterXml);
filteredResults.Add(obj);
}
catch (Exception e) {
}
}
}
return filteredResults;
}
and the method that do the deserialization
public T Deserialize<T>(string rawXml)
{
using (XmlReader reader = XmlReader.Create(new StringReader(rawXml)))
{
DataContractSerializer formatter =
new DataContractSerializer(typeof(T));
return (T)formatter.ReadObject(reader);
}
}
When I run this for a nodeList
which consists of 8000 nodes, it takes around 6 hours. I was looking for a way to cut down this time and in the beginning I thought maybe I can create a task for each iteration, but it became slower than before and I think it was because of the overhead of switching between the tasks.
I was wondering what would be the best way to improve the performance of this code as it seems to be very CPU and memory intensive?
In your Deserialize
method, I would make formatter
a static member as it will be the same every time (and I'm not sure it caches internally). Then use .ReadObject(new StringReader(rawXml))
to save the added complexity of introducing an XmlReader
.
Then it gets tricky. Try not to use exception handling to control your logic, so do some other check first rather than letting it throw and catching it.
Finally, I think the biggest win would be to not take an XmlNodeList
, but to take a Stream
and create an XmlReader
to scan the XML and only deserialize what it needs, when it needs it. The upfront cost of having all of those object graphs is going to add up fast.
Edit: Another suggestion, change the signature to IEnumerable<CustomConversionData>
and yield return
where you would do .Add(...)
, this way the consuming code can stream the results, keeping peak memory usage way down.
Edit2: selecting the ParentNode
first each time strikes me as odd. If the XmlNodeList
comes from a .ChildNodes
call, then you'll be deserializing the same thing over and over. If it is from SelectNodes("...")
then you might be able to be smarter about selecting the correct node with the XPath to begin with and not have to get the parent. If you still need to do it this way, create an XmlReader
here, check the element name, and decide from that if you need the parent. If you have the correct element then you can pass the XmlReader
into Derserialize
, saving another copy.