How to optimize performance in a simple TPL DataFlow pipeline?

Given:

Hundreds of .NET projects
Thousands of C# files across all the projects
A string literal

I want to output all the matches of the given literal in all the files across all the projects. I would like to use this example in order to understand how to optimize performance of a simple TPL DataFlow pipeline.

The complete code is committed in github - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs

The pipeline itself is:

private void Run(string workspaceRoot, string literal, int maxDOP1 = 1, int maxDOP2 = 1)
{
    var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();

    var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(YieldCSFiles, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDOP1 });
    var produceMatchingLines = new TransformManyBlock<CSFile, MatchingLine>(csFile => csFile.YieldMatchingLines(literal), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDOP2 });
    var getMatchingLines = new ActionBlock<MatchingLine>(o => Console.WriteLine(o.ToString(workspaceRoot)));

    var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };

    produceCSFiles.LinkTo(produceMatchingLines, linkOptions);
    produceMatchingLines.LinkTo(getMatchingLines, linkOptions);

    Console.WriteLine($"Locating all the instances of {literal} in the C# code ... ");
    var sw = Stopwatch.StartNew();

    projects.ForEach(p => produceCSFiles.Post(p));
    produceCSFiles.Complete();
    getMatchingLines.Completion.Wait();

    sw.Stop();
    Console.WriteLine(sw.Elapsed);
}

Here are some notes:

It is very cheap to obtain ProjectEx objects.
Accessing the property ProjectEx.MSBuildProject for the very first time is quite expensive. This is where Microsoft Build API evaluates the respective csproj file.
After the evaluation getting the list of CS files is very cheap, but processing them all is quite expensive, because there are so many of them.

I am not sure how to depict the pipeline graphically here, but:

produceCSFiles is fed cheap ProjectEx objects and outputs a lot of CSFile objects, which is expensive due to project evaluation.
produceMatchingLines is fed CSFile objects and outputs the matching lines, which is expensive due to sheer quantity of CSFile objects and the amount of line to process.

My question - is my implementation optimal? I have doubts, because increasing maxDOP1 and maxDOP2 does not yield too much of an improvement:

C:\work\TPLDataFlow [master ≡ +0 ~2 -0 !]> 1..4 |% { $MaxDOP1 = $_ ; 1..4 } |% { $MaxDOP2 = $_ ; $res = .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\dayforce\tip -l GetClientLegalPromptFlag --maxDOP1 $MaxDOP1 --maxDOP2 $MaxDOP2 -q ; "$MaxDOP1 x $MaxDOP2 --> $res" }
1 x 1 --> Elapsed: 00:00:21.1683002
1 x 2 --> Elapsed: 00:00:19.8194133
1 x 3 --> Elapsed: 00:00:20.2626202
1 x 4 --> Elapsed: 00:00:20.4339065
2 x 1 --> Elapsed: 00:00:17.6475658
2 x 2 --> Elapsed: 00:00:15.4889941
2 x 3 --> Elapsed: 00:00:14.9014116
2 x 4 --> Elapsed: 00:00:14.9254166
3 x 1 --> Elapsed: 00:00:17.6474953
3 x 2 --> Elapsed: 00:00:14.4933295
3 x 3 --> Elapsed: 00:00:14.2419329
3 x 4 --> Elapsed: 00:00:14.1185203
4 x 1 --> Elapsed: 00:00:19.0717189
4 x 2 --> Elapsed: 00:00:15.9069517
4 x 3 --> Elapsed: 00:00:16.3267676
4 x 4 --> Elapsed: 00:00:17.0876474
C:\work\TPLDataFlow [master ≡ +0 ~2 -0 !]>

What I see is:

Max improvement is when maxDOP1 == 3 and maxDOP2 == 4 - 14.12 sec vs 21.17 sec
Max ROI is when maxDOP1 == 2 and maxDOP2 == 3 - 15 sec vs 21.17 sec

All in all just 30% improvement over the single threaded version. This is a bit disappointing, because all the files are on the SSD and I have 12 logical processors. And, of course, the code is much more complicated.

Am I missing anything? Maybe I am not doing it in an optimal fashion?

Solution

This architecture is not optimal, because each of the worker blocks, the produceCSFiles and the produceMatchingLines, is doing mixed I/O-bound and CPU-bound work. Ideally you would like to have a block dedicated at doing exclusively I/O-bound, and another one doing exclusively CPU-bound work. This way you would be able to configure optimally the degree of parallelism of each block, according to the capabilities of the associated hardware component. With your current configuration it is entirely possible that at a given moment both blocks are doing I/O work, competing with each other for the SSD's attention, while the CPU is idle. And at another moment the exact opposite could be happening. The result is a chaotic and uncoordinated hubbub. Which is similar with what you would get if you used a monolithic Parallel.ForEach loop, which would probably yield comparable (mediocre) performance improvements over a single-thread approach.

Something else that you should have in mind is that the TPL Dataflow performs well when the messages passed from block to block are chunky. As the introductory document says: "provides in-process message passing for coarse-grained dataflow and pipelining tasks" (emphasis added). If the processing of each individual message is too lightweight, you'll end up with significant overhead. If you need to, you can chunkify your workload by batching the messages, using BatchBlock<T>s, the Chunk LINQ operator, or other means.

Having said all that, my assumption is that your work is disproportionately I/O bound, rendering less relevant the capabilities of your CPU. Honestly I wouldn't expect massive performance improvements, even with the most sophisticated implementation.