Search code examples
c#asynchronoustask-parallel-libraryhierarchical

Which Task<T> extension can I utilize for waiting on other tasks recursively?


I'm not really sure of an efficient way to do this. I have files, where the contents of the file point to other files, for example:

A
|-- B
|   |-- C
|   |-- D
|       |-- E
|
|-- F
    |-- C

G
|-- H
|   |-- I
|
|-- D
|   |-- E
|
|-- J

This goes on for hundreds of thousands upon thousands of files; luckily, the depth of dependencies is very shallow, but for arguments sake it is N-levels deep potentially, with no cyclical loops possible. My goal is to know the entire dependency of each file (flattened). For example:

  • A: (B, C, D, E, F) -- Note that 'C' is only listed once.
  • B: (C, D, E)
  • C: ()
  • D: (E)
  • E: ()
  • F: (C)
  • G: (D, E, H, I, J)
  • etc.

I first started by creating some model to keep track of this information:

public class FileData
{
    public string FilePath { get; set; }

    public ISet<FileInfo> DependentUpon { get; set; }
}

Of course, I then created a List<FileData> to store the processed files. Synchronously scanning the contents of the files to build up this dependency tree (and then flatten it) would just take too long, so I explored looking at async/await, which helped speed things up, but I am looking to make it even faster before unleashing it on the production environment.

My async/await attempt is much faster, but still is not efficient enough.

public async Task<ICollection<FileData>> ProcessAsync(IEnumerable<FileInfo> files)
{
    var mappings = new Dictionary<FileInfo, FileData>();

    foreach (var file in files)
    {
        // Static Method that constructs an instance of the class
        // and utilizes async I/O to read the file line-by-line
        // to build any first level dependencies.
        var info = await FileData.CreateAsync(file);

        // Update progress + Other Properties

        mappings.Add(file, info);
    }

    // Go through the list and recursively add to the dependencies
    foreach (var item in list)
    {
        foreach (var dependency in GetAllDependencies(item, mappings))
        {
           file.DependentUpon.Add(dependency);
        }
    }
}

IEnumerable<FileInfo> GetAllDependencies(FileData data, IDictionary<FileInfo, FileData> mappings)
{
    foreach (var file in info.DependentUpon)
    {
        yield return file;

        foreach (var child in GetAllDependencies(mappings[file], mappings))
        {
            yield return child;
        }
    }
}

This, of course, is nice an asynchronous somewhat, but it is still very synchronous and slow when I attempt to get the hierarchical structure (flattened). I am attempting to refactor the solution so it works much quicker taking advantage of async/await on the hierarchical search. So far, I only have the pseudo description, and I have no idea if this is possible or how to implement it properly:

Create a dictionary of FileInfo and Task<FileData> (so I'm no longer awaiting the construction of the class instances). After scanning the file for the first level DependentUpon, I find the matching tasks and only continue my current task when those tasks are completed. Of course, those tasks have the same instructions, so they will only be marked completed when their dependencies are completed. I want to start all tasks simultaneously. For example (just an example, I can't predict what task gets completed when):

  • Start Task A
  • Start Task B
  • Scan File A, DependentUpon (B, F)
  • Start Task C
  • Scan File B, DependentUpon (C, D)
  • Task A Wait until Task B and F are completed.
  • Start Task D
  • Scan File C
  • ...
  • Task D Wait until Task E is completed.
  • Scan File E, DependentUpon ()
  • Task E Completed
  • Task D Completed
  • Task C Completed
  • Task B Completed.
  • Start Task J
  • Task F Completed.
  • Task A Completed.
  • ...
  • All tasks completed

Solution

  • Consider using Task.WhenAll<> to concurrently wait on tasks loading (recursively) the root items. You can also defer the dependency list expansion, which reduces the running time for your function and uses memory more efficiently.

        public class FileData
        {
           public string FilePath { get; set; }
    
           public ISet<FileInfo> DependentUpon { get; set; }
    
           public IEnumerable<FileInfo> Dependencies {get; set;}
        }
    

    New property Dependencies provides a list of all dependent items. DependentUpon now contains only immediate dependencies and does not have to change.

        public async Task<ICollection<FileData>> ProcessAsync(IEnumerable<FileInfo> files)
        {
            var map = new Dictionary<FileInfo, Task<FileData>>();
            var tasks = files.Select(it => LoadFileDataAsync(it, map));
            return await Task.WhenAll(tasks);
        }
    
        static async Task<FileData> LoadFileDataAsync(FileInfo fileInfo, Dictionary<FileInfo, Task<FileData>> map)
        {
           // Load recursively FileData elements for all children 
           // storing the result in the map.
    
            Task<FileData> pendingTask;
            bool isNew;
    
            lock (map)
            {
                isNew = !map.TryGetValue(fileInfo, out pendingTask);
                if (isNew)
                {
                    pendingTask = FileData.CreateAsync(fileInfo);
                    map.Add(fileInfo, pendingTask);
                }
            }
    
            var data = await pendingTask;
            if (isNew)
            {
               // Assign an iterator traversing through the dependency graph
               // Note: parameters are captured by reference.
               data.Dependencies = ExpandDependencies(data.DependsUpon, map);
               if (data.DependsUpon.Count > 0)
               {
                  // Recursively load child items
                  var tasks = data.DependsUpon.Select(it => (Task)LoadFileDataAsync(it, map));
                  await Task.WhenAll(tasks);
               }
            }
            return data;
        }
    
    
        static IEnumerable<FileInfo> ExpandDependencies(ISet<FileInfo> directDependencies, Dictionary<FileInfo, Task<FileData>> map)
        {
    
            if (directDependencies.Count == 0)
            {
                yield break;
            }
    
            // Depth-first graph traversal
    
            var visited = new HashSet<FileInfo>(map.Comparer); // check for duplicates
            var stack = new Stack<FileInfo>();
    
            foreach(var item in directDependencies)
            {
                stack.Push(item);
            }
    
            while(stack.Count > 0)
            {
                var info = stack.Pop();
    
                if (visited.Add(info))
                {
                    yield return info;
    
                    var data = map[info].Result;
    
                    foreach (var child in data.DependsUpon)
                    {
                        stack.Push(child);
                    }
                }
            }
        }
    

    Working code samlple