multithreading complexity-theory computation-theory

Splitting a computational workload: where is it possible or impossible

I program although I am not a computer scientist. Therefore I would like to see if I understood correctly the challenge in splitting a workload. Is this below the correct way to think about it?

Specifically, is the following statement (1) correct?

(1) If A(X_a) + A(X_b) + A(X_c) + ... = B(X_a,X_b,X_c, ...) = Y is an equation that is being computed

whether or not it can be computed more rapidly from the perspective of the computer by assigning parts of the equation to be computed by individual threads at the same time depends on the following

if X_m changes when A(X_n) changes for m not equal to n, then dividing the workload for that particular computation is gives less of a performance gain, and if this is true for every combination of m and n in the system, then no performance gain for multithreading over single threading is possible.

Or in other words do I understand correctly that presence of linked variables decreases ability to multithread successfully because X_b and X_c depend on what A(X_a) and it bottlenecks the process: the other threads know A but have to wait for the first thread to give an output before they have instructions to execute, so simultaneous working on parts of an instruction which is easily broken up into parts cannot be done and the computation takes as much time one one thread doing each part of the calculation one after the other as it does to perform on more than one thread working at once and summing the results in order they complete on the fly on another thread.

(2) Or is there a way around the above bottleneck? For example if this bottleneck is known in advance, the first thread can start early and store in memory the results to A(X_n) for all n that bottleneck the operation, and then split the workload efficiently, one A(X_i) to the i th thread, but to do this, the first thread would have to predict in some way when the calculation B(X_a,X_b,X_c, ...) must be executed BEFORE B(X_a,X_b,X_c, ...) is actually executed, otherwise it would run into the bottleneck.

[EDIT: To clarify, in context of NWP's answer. If the clarification is too long / unclear, please leave a comment, and I'll make a few graphics in LaTeX to shorten the question writeup.]

Suppose the longest path in the program "compute I" is 5 units of time in the example. If you know this longest path, and the running system can anticipate (based on past frequency of execution) when this program "compute I" will be run in the future, subprogram "compute B->E" (which does not depend on anything else but is a proper subset of the longest path of program "compute I") may be executed in advance. The result is stored in memory prior to the user requesting "compute I".

If so, is the max speedup considered to be 9/4? The B->E is ready, so other threads do not have to wait for it. Or is max speed up for "compute I" still considered to be 9/5?

The anticipation program run before has a cost, but this cost may be spread over each instance of execution of "compute I". If the anticipation program has 15 steps, but the program "compute I" is run typically 100 times per each execution of the anticipation program, and all steps cost equally, do we simply say the max speedup possible in "compute I" is therefore 9/(5 - 1 + 15/100)?

The speedup possible now appears to depend not only on the number of threads, and the longest path, but also on the memory available to store precalculations and how far in advance another program can anticipate "compute I" will be run and precalculate proper subprograms of it. Another program "compute X" may have the same length of the longest path as "compute I" but the system cannot anticipate that "compute X" will be run equally as far in advance as "compute I". How do we weight the speedup achieved (i) at expense of increasing memory to store precalculations (ii) timing of execution of some programs can be anticipated further in advance than of other program allowing bottleneck to be precalculated and this way cutting down the longest path?

But if a longest path can be dynamically cut down in dynamics by improving predictive precalculation of subprograms and greater memory for storing results of precalculation, can bottlenecks be considered at all as determining the ultimate upper boundary to speedup due to splitting a computational workload?

From the linked variables dependency bottleneck perspective / graph bottle perspective, the ultimate upper boundary of speedup to multithreading a program "compute I" appears to be determined by longest subprogram (other subprograms depend on it / wait for it). But from the dynamics perspective, where the whole system is running before and after the program "compute I" is executed as a part of it, sufficient predictability of timing of future execution of "compute I" and ability to store more and more precalculations of its independent subprograms can completely cut down length of all subprograms of "compute I" to 1 unit, meaning it can in possibly achieve a speedup of 9/1 = 9, if sufficient predictability and memory is available.

Which perspective is the correct one for estimating the upper bounds to speedup by multithreading? (A program run in a system running a long time with sufficient memory seems to have no limit to multithreading, whereas if it is looked at by itself, there is a very definite fixed limit to the speedup.)

Or is the question of ability to cut down longest path by anticipation and partial precalculation a moot one because speedup in that case varies with the user's decision to execute a program in a way that can be predicted and so cannot upper boundary to multithreading speedup due to anticipation cannot be know to a program writer or system designer and should be ignored / not relied upon to exist?

Solution

I do not quite understand which things depend on what from your description but I can give you some theory. There is Ahmdal's law which gives you an upper bound of the speedup you can achieve based on how parallelizable a given algorithm is assuming you have enough processors. If you can parallelize 50% of the calculation you can get a maximum speedup of 2x. 95% parallelization gives you a maximum speedup of 20x. To figure out how much speedup you can get you need to know how much of your problem can be parallelized. This can be done by drawing a graph of the things you need to do and which depend on what and figure out the longest path. Example:

flowchart

In this example the longest path would be B->E->F->H->I. All blocks are assumed to take the same time to execute. So there are 9 blocks, the longest path is 5 blocks, so your maximum achievable speedup is 9/5 = 1.8x. In practice you need to consider that your computer can only run a limited number of threads in parallel, that some blocks take longer than others and that there is a cost involved in creating threads and using appropriate locking mechanisms to prevent data races. Those can be added to the graph by giving each block a cost and finding the longest path based on adding cost including the cost of threading mechanisms. Although this method only gives you an upper bound it tends to be very humbling. I hope this allows you to draw a graph and find the answer.

EDIT: I forgot to say that Ahmdal's law compares executing the code with a single thread to executing the code with an infinite number of threads with no overhead. If you make the multithreaded version execute different code than the single threaded version you are no longer bound by Ahmdal's law.

With enough memory and time you can calculate the results for all possible inputs and then just do a lookup based on a given input to find the result. Such a system would get higher speedup because it does not actually calculate anything and is not bound by Ahmdal's law. If you manage to optimize B->E to take zero units of time the longest path becomes 3 and there are only 8 nodes giving you a maximum speedup of 8/3 = 2.66x which is better than the 1.8x of before. That is only the speedup possibility by multithreading though, actually the first version takes 4 time units and the second version 3 time units. Optimizing code can give you more speedup than multithreading. The graph can still be useful though. Assuming you do not run out of cores the graph can tell you which parts of your program are worth optimizing and which are not. Assuming you do run out of cores the graph can tell you which paths should be prioritized. In my example I calculate A, B, C and D simultaneously and therefore need a quadcore to make it work. If I move C down in time to execute in parallel to E and make D run parallel to H a dualcore will suffice for the same speedup of 1.8x.