Given:
I want to output all the matches of the given literal in all the files across all the projects. I would like to use this example in order to understand how to optimize performance of a simple TPL DataFlow pipeline.
The complete code is committed in github - https://github.com/MarkKharitonov/LearningTPLDataFlow/blob/master/FindStringCmd.cs
The pipeline itself is:
private void Run(string workspaceRoot, string literal, int maxDOP1 = 1, int maxDOP2 = 1)
{
var projects = (workspaceRoot + "build\\projects.yml").YieldAllProjects();
var produceCSFiles = new TransformManyBlock<ProjectEx, CSFile>(YieldCSFiles, new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDOP1 });
var produceMatchingLines = new TransformManyBlock<CSFile, MatchingLine>(csFile => csFile.YieldMatchingLines(literal), new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = maxDOP2 });
var getMatchingLines = new ActionBlock<MatchingLine>(o => Console.WriteLine(o.ToString(workspaceRoot)));
var linkOptions = new DataflowLinkOptions { PropagateCompletion = true };
produceCSFiles.LinkTo(produceMatchingLines, linkOptions);
produceMatchingLines.LinkTo(getMatchingLines, linkOptions);
Console.WriteLine($"Locating all the instances of {literal} in the C# code ... ");
var sw = Stopwatch.StartNew();
projects.ForEach(p => produceCSFiles.Post(p));
produceCSFiles.Complete();
getMatchingLines.Completion.Wait();
sw.Stop();
Console.WriteLine(sw.Elapsed);
}
Here are some notes:
ProjectEx
objects.ProjectEx.MSBuildProject
for the very first time is quite expensive. This is where Microsoft Build API evaluates the respective csproj file.I am not sure how to depict the pipeline graphically here, but:
produceCSFiles
is fed cheap ProjectEx
objects and outputs a lot of CSFile
objects, which is expensive due to project evaluation.produceMatchingLines
is fed CSFile
objects and outputs the matching lines, which is expensive due to sheer quantity of CSFile
objects and the amount of line to process.My question - is my implementation optimal? I have doubts, because increasing maxDOP1
and maxDOP2
does not yield too much of an improvement:
C:\work\TPLDataFlow [master ≡ +0 ~2 -0 !]> 1..4 |% { $MaxDOP1 = $_ ; 1..4 } |% { $MaxDOP2 = $_ ; $res = .\bin\Debug\net5.0\TPLDataFlow.exe find-string -d C:\dayforce\tip -l GetClientLegalPromptFlag --maxDOP1 $MaxDOP1 --maxDOP2 $MaxDOP2 -q ; "$MaxDOP1 x $MaxDOP2 --> $res" }
1 x 1 --> Elapsed: 00:00:21.1683002
1 x 2 --> Elapsed: 00:00:19.8194133
1 x 3 --> Elapsed: 00:00:20.2626202
1 x 4 --> Elapsed: 00:00:20.4339065
2 x 1 --> Elapsed: 00:00:17.6475658
2 x 2 --> Elapsed: 00:00:15.4889941
2 x 3 --> Elapsed: 00:00:14.9014116
2 x 4 --> Elapsed: 00:00:14.9254166
3 x 1 --> Elapsed: 00:00:17.6474953
3 x 2 --> Elapsed: 00:00:14.4933295
3 x 3 --> Elapsed: 00:00:14.2419329
3 x 4 --> Elapsed: 00:00:14.1185203
4 x 1 --> Elapsed: 00:00:19.0717189
4 x 2 --> Elapsed: 00:00:15.9069517
4 x 3 --> Elapsed: 00:00:16.3267676
4 x 4 --> Elapsed: 00:00:17.0876474
C:\work\TPLDataFlow [master ≡ +0 ~2 -0 !]>
What I see is:
maxDOP1 == 3
and maxDOP2 == 4
- 14.12 sec vs 21.17 secmaxDOP1 == 2
and maxDOP2 == 3
- 15 sec vs 21.17 secAll in all just 30% improvement over the single threaded version. This is a bit disappointing, because all the files are on the SSD and I have 12 logical processors. And, of course, the code is much more complicated.
Am I missing anything? Maybe I am not doing it in an optimal fashion?
This architecture is not optimal, because each of the worker blocks, the produceCSFiles
and the produceMatchingLines
, is doing mixed I/O-bound and CPU-bound work. Ideally you would like to have a block dedicated at doing exclusively I/O-bound, and another one doing exclusively CPU-bound work. This way you would be able to configure optimally the degree of parallelism of each block, according to the capabilities of the associated hardware component. With your current configuration it is entirely possible that at a given moment both blocks are doing I/O work, competing with each other for the SSD's attention, while the CPU is idle. And at another moment the exact opposite could be happening. The result is a chaotic and uncoordinated hubbub. Which is similar with what you would get if you used a monolithic Parallel.ForEach
loop, which would probably yield comparable (mediocre) performance improvements over a single-thread approach.
Something else that you should have in mind is that the TPL Dataflow performs well when the messages passed from block to block are chunky. As the introductory document says: "provides in-process message passing for coarse-grained dataflow and pipelining tasks" (emphasis added). If the processing of each individual message is too lightweight, you'll end up with significant overhead. If you need to, you can chunkify your workload by batching the messages, using BatchBlock<T>
s, the Chunk
LINQ operator, or other means.
Having said all that, my assumption is that your work is disproportionately I/O bound, rendering less relevant the capabilities of your CPU. Honestly I wouldn't expect massive performance improvements, even with the most sophisticated implementation.