Search code examples
c#performancedictionaryoptimizationmemory-management

How do I improve the performance of this C# code - looping through a DataTable and building a Dictionary?


I am looping through a collection (DataTable) where each item has a "group number" and building a dictionary where each key corresponds to the group number. There's about a million records in the DataTable and I am running into memory issues when this runs. I was wondering if I could get some feedback on how to make this perform better. I think maybe one issue is where I am creating a new tempList on each iteration.

I really appreciate any feedback here.

Dictionary<int, IEnumerable<Item>> itemGroups = new Dictionary<int, IEnumerable<Item>>();
foreach (DataRow row in dtItems.Rows)
{
    Item item = new Item(row["ID"].ToString(),
        row["Name"].ToString(),
        row["Description"].ToString());

    Int32.TryParse(row["GroupNum"].ToString(), out int groupNum);
    if (!itemGroups.ContainsKey(groupNum))
    {
        List<Item> itemList = new List<Item>();
        itemList.Add(item);
        itemGroups.Add(groupNum, itemList);
    }
    else
    {
        var tempList = itemGroups[groupNum].ToList();
        tempList.Add(item);
        itemGroups[groupNum] = tempList;
    }
}

Solution

  • You can simplify it by using LINQ. The GroupBy extension method is already optimized for aggregating items.

    Dictionary<int, List<Item>> itemGroups = dtItems.Rows.Cast<DataRow>()
        .GroupBy(row => (int)row["GroupNum"])
        .ToDictionary(g => g.Key,
            g => g.Select(row => new Item(
                row["ID"].ToString(),
                row["Name"].ToString(),
                row["Description"].ToString())
            ).ToList());
    

    As other have already pointed out, don't convert values if you can cast them. If the GroupNum column contains an int, then simply cast the value to int. However, if a column is nullable, then it can contain a DBNull.Value. Let's assume that ID and Name are NOT NULL columns but that the Description is optional and therefore a NULL column. Then you could optimize the creation of an item like this:

    g => g.Select(row => new Item(
        (string)row["ID"],
        (string)row["Name"],
        row["Description"].ToString())
    

    Note also that an IEnumerable<T> queries the source every time it is enumerated. An IEnumerable<T> is not a collection, but a way to iterate over a data source. We could have dropped the .ToList() and typed the dictionary values as IEnumerable<Item> instead. But with the consequence that enumerating the dictionary values would have executed the whole Select including the new Item(...) every time.

    see also: Understanding C# LINQ’s deferred execution