How do I monitor progress in a TPL Dataflow mesh?

I'm working on a C# app with a time-consuming sequential workflow that must be performed asynchronously. It starts when the user presses a button and the app receives a few images captured from a camera within just a few milliseconds. The work flow then.

Saves the images to disk
Aligns them.
Generates 3d data from them.
Groups them into a larger, collective object (called a "Scan").
Add optional analysis data to this scan and executes it.
Finally saves the scan itself is saved to an xml file alongside the images.

Some of these steps are optional and configurable.

Since the processing can take so long, there will often be a queue of "scans" awaiting processing So I need to present to a user a visual representation of the queue of captured scans, their current processing state (e.g. "Saving", "Analyzing", "Finished" etc.)

I've looked into using TPL DataFlow for this. But while the mesh is simple to create, I'm not getting just how I might monitor the status of what is going on so that I can update a user interface. Do I try to link custom action blocks that post back messages to the UI for that? Something else?

Is TPL Dataflow even the right tool for this job?

Solution

Reporting Overall Progress

When you consider that a TPL DataFlow graph has a beginning and end block and that you know how many items you posted into the graph, all you need do is track how many messages have reached the final block and compare it to the source count of messages that were posted into the head. This will allow you to report progress.

Now this works trivially if the blocks are 1:1 - that is, for any message in there is a single message out. If there is a one:many block, you will need to change your progress reporting accordingly.

Reporting Job Stage Progress

If you wish to present progress of a job as it travels throughout the graph, you will need to pass job details to each block, not just the data needed for the actual block. A job is a single task that must span all the steps 1-6 listed in your question.

So for example step 2 may require image data in order to perform alignment but it does not care about filenames; how many steps there are in the job or anything else job related. There is insufficient detail to know state about the current job or makes it difficult to lookup the original job based on the block input alone. You could refer to some external dictionary but graphs are best designed when they are isolated and deal only with data passed into each block.

So a simple example would be to change this minimal code from:

var alignmentBlock = new TransformBlock<Image, Image>(n => { ... });

...to:

var alignmentBlock = new TransformBlock<Job, Job>(x => 
{
     job.Stage = Stages.Aligning;

     // perform alignment here
     job.Aligned = ImageAligner.Align (x.Image, ...);

     // report progress 

     job.Stage = Stages.AlignmentComplete;
});

...and repeat the process for the other blocks.

The stage property could fire a PropertyChanged notification or use any other form of notification pattern suitable for your UI.

Notes

Now you will notice that I introduce a Job class that is passed as the only argument to each block. Job contains input data for the block as well as being a container for block output.

Now this will work, but the purist in me feels that it would be better to perhaps keep job metadata separate what is TPL block input and output otherwise there is potential state damage from multiple threads.

To get around this you may want to consider using Tuple<> and passing that into the block.

e.g.

var alignmentBlock = new TransformBlock<Tuple<Job, UnalignedImages>, 
                                        Tuple<Job, AlignedImages>>(n => { ... });