Search code examples
c#xmloptimizationdatatablexmlreader

Reading XML and creating a frequency count of elements in C#


I have an XML file in this format (but only much bigger) :

<customer>
    <name>John</name>
    <age>24</age>
    <gender>M</gender>
</customer>
<customer>
    <name>Keith</name>
    <age></age>         <!--blank value-->
    <gender>M</gender>
</customer>
<customer>
    <name>Jenny</name>
    <age>21</age>
    <gender>F</gender>
</customer>
<customer>
    <name>John</name>   
    <age>24</age>       <!--blank value-->
    <gender>M</gender>  <!--blank value-->
</customer>

I want to generate a DataTable which will be in this format :

Element NameValueFrequency

name       filled     4
name       blank    0

age          filled     2
age       blank    2

gender        filled    3
gender        blank    1

Currently I am completing this task in 2 parts, first creating a DataTable structure as above and setting all the frequencies to 0 as default. And then reading the XML using XmlReader, and increasing the count of the frequency everytime XmlReader finds a child element.

My problem is that the second function that I use for adding the actual count is taking too long for very big Xml files with many customers having many attributes. How can I improve the efficiency of this function?

My code :

static void AddCount(DataTable dt)
{
     int count;
     using (XmlReader reader = XmlReader.Create(@"C:\Usr\sample.xml"))
     {
         while (reader.Read())
            {
                if (reader.IsStartElement())
                {
                    string eleName = reader.Name;
                    DataRow[] foundElements = dt.Select("ElementName = '" + eleName + "'");  
                    if (!reader.IsEmptyElement)
                    {
                       count = int.Parse(foundElements.ElementAt(0)["Frequency"].ToString());  
                       foundElements.ElementAt(0).SetField("Frequency", count + 1);
                    }
                    else
                    {
                       count = int.Parse(foundElements.ElementAt(0)["Frequency"].ToString());  
                       foundElements.ElementAt(0).SetField("Frequency", count + 1);
                    }
                }
            }   
       }   
  }    

I am also ready to change the XmlReader class for any other more efficient class for this task. Any advice is welcome.


Solution

  • It turned out that querying in the DataTable using the Select operation was very expensive and that was making my function very slow.

    Instead of that, used a Dictionary<string, ValueFrequencyModel> and queried on that to fill the dictionary with the count, and after completing that, converted the Dictionary<string, ValueFrequencyModel> into a DataTable.

    This saved loads of time for me and solved the problem.