Search code examples
c#datatableparallel-processingparallel.foreach

Parallel.ForEach and DataTable - Isn't DataTable.NewRow() a thread safe "read" operation?


I'm converting an existing application to take advantage of multiple processors. I have some nested loops, and I've converted the inner-most loop into a Parallel.Foreach loop. In the original application, inside the inner-most loop, the code would call DataTable.NewRow() to instantiate a new DataRow of the appropriate layout, populate the columns and add the populated DataRow into the DataTable with DataTable.Add(). But since DataTable is only thread-safe for read operations, I have converted the processing to add populated DataRow objects into a ConcurrentBag<DataRow> object. Then, once the Parallel.Foreach looping finishes, I iterate the ConcurrentBag and add the DataRow objects into the DataTable. It looks something like this...

DataTable MyDataTable = new DataTable()
// Add columns to the data table

For(int OuterLoop = 1; OuterLoop < MaxValue; OuterLoop++)
{
    //Do Stuff...

    ConcurrentBag<DataRow> CB = new ConcurrentBag<DataRow>();

    Parallel.Foreach(MyCollectionToEnumerate, x => 
    {
        //Do Stuff

        DataRow dr = MyDataTable.NewRow();
        // Populate dr...
        CB.Add(dr);
    {);

    ForEach(DataRow d in CB)
        MyDataTable.Add(d);
}

So when this runs, I'm seeing an, "Index was outside the bounds of the array." exception on the call to MyDataTable.NewRow(). But wouldn't NewRow() be a thread-safe, Read operation? Sure, it instantiates a new DataRow object, and that's not a read. But it doesn't need to modify the DataTable object, does it?

This might help a bit... When I look at the exception, the top two items on my call stack are...

   at System.Data.DataTable.NewRow(Int32 record)
   at System.Data.DataTable.NewRow()
   at ...

And I see that NewRow() calls what must be a private NewRow(int32) method. So maybe that's the issue. But I'm not sure how to solve it. If I have to, I can get creating and instead of instantiating the DataRow object from within my Parallel.Foreach loop, just instantiate a custom object that looks a lot like my DataTable and once the loop exits, instantiate the actual DataRows and add them to the DataTable. But that is less than elegant, and instantiates "unnecessary" objects. And my goal is to improve performance, so that seems counterproductive.

Thanks for any assistance.


Solution

  • No, NewRow is not a "read" operation and is not thread safe.

    Instead of using NewRow and populating the row you could just place your values in an array or list of object. Then when you've collected all of your data you can add it all to the DataTable.

    var newRow = table.NewRow();
    newRow.ItemArray = values; // array of values
    table.Rows.Add(newRow);
    

    That way the you can parallelize the creation of your data without running into issues when you add it to the DataTable.


    Looking at the source code for DataTable:

    It contains various fields:

    private readonly DataRowBuilder rowBuilder;
    internal readonly RecordManager recordManager;
    

    NewRow() calls NewRow(-1), and NewRow(int) modifies the state of those fields:

        internal DataRow NewRow(int record) {
            if (-1 == record) {
                record = NewRecord(-1);
            }
    
            rowBuilder._record = record;                  // here
            DataRow row = NewRowFromBuilder( rowBuilder );
            recordManager[record] = row;                  // here
    
            if (dataSet != null)
                DataSet.OnDataRowCreated( row );
    
            return row;
        }
    

    ...and there's much more that I haven't followed. But what's clear is that NewRow() does more than just return a new row - it modifies the state of the DataTable instance all over the place.

    The documentation never said it was thread safe, but I would have guessed that because you still have to add the row to the table, NewRow didn't modify the DataTable. But I would be wrong, and it's definitely not thread safe.

    Another flag is in the documentation for NewRow

    After creating a DataRow, you can add it to the DataRowCollection, through the DataTable object's Rows property. When you use NewRow to create new rows, the rows must be added to or deleted from the data table before you call Clear.

    It doesn't say what will happen if you call Clear() without adding or deleting a row created with NewRow(). An exception? Will I die? So I tried. I'm still here, but calling Clear() replaced all of the values in each row with DBNull.Value, further underscoring that the rows are not "disembodied" until they are added to the DataTable. They are part of its state.