Search code examples
c#iotask-parallel-library

Nested Parallel.For() loops and file creation problems


I've been investigating TPL as means of quickly generating a large volume of files - I have about 10 million rows in a database, events which belong to patients, which I want to output into their own text file, in the location d:\EVENTS\PATIENTID\EVENTID.txt

I'm using a two nested Parallel.ForEach loops - the outer in which a list of patients is retrieved and the inner in which the events for a patient are retrieved and written to a file.

This is the code I'm using, it's pretty rough at the moment, as I'm just trying to get things working.

DataSet1TableAdapters.GetPatientsTableAdapter ta = new DataSet1TableAdapters.GetPatientsTableAdapter();
List<DataSet1.GetPatientsRow> Pats = ta.GetData().ToList();

List<DataSet1.GetPatientEventsRow> events = null;

string patientDir = null;

System.IO.DirectoryInfo di = new DirectoryInfo(txtAllEventsPath.Text);
di.GetDirectories().AsParallel().ForAll((f) => f.Delete(true));

//get at the patients
Parallel.ForEach(Pats
        , new ParallelOptions() { MaxDegreeOfParallelism = 8 }
        , patient =>
{
    patientDir = "D:\\Events\\" + patient.patientID.ToString();

    //Output directory
    Directory.CreateDirectory(patientDir);
    events = new DataSet1TableAdapters.GetPatientEventsTableAdapter().GetData(patient.patientID).ToList();


    if (Directory.Exists(patientDir))
    {
        Parallel.ForEach(events.AsEnumerable()
            , new ParallelOptions() { MaxDegreeOfParallelism = 8 }
            , ev =>
            {
                List<DataSet1.GetAllEventRow> anEvent = 
                    new DataSet1TableAdapters.GetAllEventTableAdapter();    

                File.WriteAllText(patientDir + "\\" + ev.EventID.ToString() + ".txt", ev.EventData);
            });
    }

});

The code I have produced works very quickly but produces an error after a few seconds (in which about 6,000 files are produced). The error produced is one of two types:

DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.

Whenever this error is produced, the directory structure D:\Events\PATIENTID\ exists, as other files have been created within that directory. An if condition checks for the existence of D:\Events\PATIENTID\ before the second loop is entered.

The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.

When this error occurs, sometimes the indicated file exists or doesn't.

So, can anyone of any advice as to why these errors are being produced. I don't understand either, and as far I can see, it should just work (and indeed does, for a short while).


Solution

  • From MSDN:

    Use the Parallel Loop pattern when you need to perform the same independent operation for each element of a collection or for a fixed number of iterations. The steps of a loop are independent if they don't write to memory locations or files that are read by other steps.

    Parallel.For can speed up the processing of your rows by doing multi threading but it comes with a caveat that if it is not used correctly it will end with unexpected behavior of the program like the one you are having above.

    The reason for following error :

    DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.

    can be that the one thread goes to write and the directory is not there mean while the other thread creates that. Normally when doing parallelism there can be race conditions as we are doing multi-threading and if we don't use proper mechanics like locks or monitors then we end up with these kind of issues.

    As you are doing file writing so multiple threads when trying to write to the same file end up with the error you have latter i.e.

    The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.

    as one thread is already writing to file so at that time other threads would fail to access the file for writing to it.

    I would suggest to use a normal loop instead of parallelism here.