parallel-processing multiprocessing distributed-computing parallelism-amdahl

Calculating the maximum speedup with parallelization

Amdahl's Law lets us calculate the maximum theoretical speedup of a programme when adding more and more processing capacity to our hardware. This is stated by
T = 1 / ((1-P) + (P/N)) where (1-P) is the part of the programme that is sequential and (P/N) is the part which can benefit from speedup. Now what Amdahl's law leaves out, is the factor of overhead. To count that in, we can say
T = 1 / ((1-P) + 0(N) + (P/N)) where 0(N) represents the synchronization effort that increaes with the increasing number of computing nodes.

Now my question is: How can I calculate the maximum speedup of a programme with keeping 0(N) in mind? Let's say we have a programme that's sequential part is 25% and the parallel part is 75%.

Solution

Let's start with a quote :

what Amdahl's law leaves out, is the factor of overhead.

while I support the contemporary criticism of the using an overhead-naive formula, I have to first oppose, that Dr. Gene Amdahl's argument, published in 1967, holds up until today, as his seminal work was related to the realm of process-execution, not the derived fields of pure-[SERIAL], just-[CONCURRENT] or True-[PARALLEL] execution of different code-execution tricks, that accompany the scheduling in the context of CPU/GPU/MMC/FPGA-devices programming.

Given these add-on costs, an originally pure-[SERIAL] syntax of code, gets extended by additional blocks of new, special, meta-#decorated or direct-syntax elements, that get interpreted and may get compiled into new sections of machine-dependent machine-code.

These new parts introduce, typically O(1) add-on costs, whereas executing them does not remain as constant TimeDOMAIN and SpaceDOMAIN costs (typically being dominantly polynomic wrt SPACE, spending more TIME as an impact of the limited resources for moving data needed for such spawned/distributed process-related operations ( parameter transfers + results returned ) )

In case you introduce costs-of-synchronisations, the so far True-[PARALLEL] process-execution stops to be truly a [PARALLEL]-one and returns from such a territory of fully independent [PARALLEL]-process scheduling into a pure-[SERIAL]-ordering of somewhat semi-[PARALLEL] ( if not but just-[CONCURRENT] ) sections of code-execution, sub-ordinated to some kind of externally enforced policy ( be it a transparent sync-ing or a resources-pool limitations' governed execution orchestration etc )

Q : How can I calculate the maximum speedup of a programme with keeping 0(N) in mind?

Re-read the sections about add-on overheads and atomicity-of-work and rather use the Overhead-strict and resources-aware re-formulation of the original Amdahl's Law.

pSO:= [PAR]-Setup-Overhead     add-on
pTO:= [PAR]-Terminate-Overhead add-on

For small-scale problems, the lumpsum of costs - of pSPACE + pTIME add-on costs of parameter transfers ( having somewhat the sizing dependence on the amount of data-xfers) plus the initial pSO + pTO being most of the time cTIME + cSPACE add-on cost of new process(es) instantiation(s) inside the localhost O/S or during a (potentially heterogeneous) distributed-system configuration, so as to meet and welcome the new process to execute it and terminate it once finished, can easily exceed the benefits coming from ( ( 1 - s )/N ) and such cases will never gain a bit of speedup, but resort into paying more add-on costs, than ever can and will get justified by the Amdahl's argument. That is never a fault of Dr. Amdahl, is it?

For computing tasks, that are large and computing-intensive, the pSO + pTO can get justified by the split and (depending a lot on s, determining the remainder of the "Parallel portion" being ( 1 - s ) ) [PARALLEL] efforts may reach some pleasant speedups :

in these large & dense cases not much devastated by a principally variable, but relatively negligible parts of the cTIME + cSPACE data-related initial parameters + return values transfers.

With ( 1 - s ) = 75% the speedup is principally less than 4 for even infinitely many CPUs.

With, best benchmarked, all add-on overheads and given some known processing granularity with atomicity-of-work might also be identified, you will get the net-speedup expectations, that will be close to the resulting processing, given also all due steps for non-blocking execution with race-conditions and O/S restrictions avoidance techniques were put in place in due form and fashion.