Search code examples
haskellparallel-processingparallelism-amdahl

How non-trivial should a computation be to make it reasonable to get it sparked for a parallel execution in Haskell?


According to the doumentation for Control.Parallel, one should make sure that the computation being sparked is non-trivial so that creating the spark is cheaper than the computation itself.

This makes sense, but after listening to Bartosz Milewski talk about how cheap sparks are, I'm wondering how experienced Haskell programmers determine whether or not a computation is worthy of parallelism.


Solution

  • This subject is facts, not opinions based.

    Please take notice of a few facts on the actual overhead costs before reading:

    ".. creating spark doesn't immediately wakeup idle capability, see here. By default scheduling interval is 20ms, so when you create a spark, it will take up to 20 ms to turn it to a real thread. By that time the calling thread most likely will already evaluate the thunk, and the spark will be either GC'd or fizzled.

    By contrast, forkIO will immediately wakeup idle capability if any. That is why explicit concurrency is more reliable then parallel strategies."

    So, remember to add +20 ms and/or the benchmarked costs of forkIO-spawned functional block, to the below cited add-on overheads in the realistically achievable cost/benefit ( speedup ) formulae.


    This problem has been solved by Dr. Gene AMDAHL many decades ago

    A bit more recent generations of C/S students or practitioners just seem that have somehow forgotten the elementary process-scheduling logic ( i.e. the rules, not art, of proper organising the flow of code-execution over the system's restricted physical resources ).

    Though a fair objection may and will come from the nature of the functional languages, where lambda-calculus can and often does harnesses a space, otherwise hidden for imperative languages, for going into a smart, fine-grain, parallelism, derived right from the laws of lambda- or pi- calculi.

    Yet the core message holds and is here, for more than 60 years.


    A piece of quantitative, fair, records-of-evidence based rationale on this is well enough: ( no magic, no hidden Art of whatever nature )

    Please, first try to do one's own best to first fully understand both the original formulation of the Amdahl's Law, plus kindly also revise the recent criticism and overhead-strict, resources-aware re-formulation of the original, generally valid, universal system-scheduling law.

    Additions, in the [ Criticism ] section, were meant exactly to match what actually happens, when someone comes to a piece of code with an idea to "re-organise" the computing graph and enter into the process-flow some either "just"-[CONCURRENT] or true-[PARALLEL] computing syntax-constructors ( whatever the actual code-execution tools are ).


    Having got the overhead-strict Amdahl's Law theory, let's measure:

    This part is easy and systematic: my students often rant, but going forward, each of them collects a set of hands on experience, what it actually takes ( in costs ) to go into any form of a promise to use a parallelised code-execution.

    1 ) create a NOP-function - a function, that indeed does nothing, except of being run ( without an obligation to pass any, the less any remarkable in volume, arguments, without trying to allocate any single bit of memory during its ( empty )-code-execution and without returning any value "back" ). This is an idealised NOP-function payload, to let it being spawned / sparked / distributed into parallelism-tool of choice driven execution.

    Having the NOP-fun ready, lets benchmark the pure-overhead of such NOP-fun code-execution in multiple instances and measure the time it took.

    Being sure all such instances were doing indeed nothing "there", the lump sum of time spent between the two time-lines were -- hoooray -- the pure overhead cost of going parallelised and process re-collection overhead cost.

    So simple, so easy.

    Different tool differ in how much costs a user-programme will accrue, but both the metric and the methodology is crystal-clear.

     CuT_start <- liftIO $ getCurrentTime  -- a Code_Under_Test___START
                                           -- I use a ZeroMQ Stopwatch() instance
                                           -- having a better than [us] resolution
                                           -- but haven't found it in Haskell binding
     --CuT_TIMING_CRITICAL_SECTION_/\/\/\/\/\/\\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
     <_CuT_a_Code_syntax-constructor_Under_Test_>
     --CuT_TIMING_CRITICAL_SECTION_/\/\/\/\/\/\\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\
    
     CuT_end <- liftIO $ getCurrentTime    -- a Code_Under_Test___END
    
     liftIO $ print ( diffUTCTime CuT_end CuT_start )
    

    2 ) having registered a net cost of spawning / sparking the intended amount of jobs, one may forward with:
    - adding "remote" memory allocations ( literally till the swapping kills the O/S )
    - adding "remote" CPU-bound processing ( again, as far as one smells the fatigue of the O/S kernel's scheduling efforts to create some yet feasible hardware-threads mapping )
    - adding "remote" process testing as per the call-interface scaling ( volume of data with a need to pass from caller to callee ) dependencies
    - adding "remote" process return value(s)
    - adding "remote" process needs to access some shared resource


    The final decision Go - No Go :

    All these, above collected and recorded add-on costs, just increase the real-world code overhead costs, that have to be entered into the recent, re-formulated Amdahl's Law.

    If and only if
    the overhead-strict, resources-aware speedup result is >> 1.00 it makes sense to go into parallelised code-execution

    In all cases, where
    the "improved" speedup is actually <= 1.00 it would be indeed a very bad idea to pay more than one receives from such an "improvement"

    ( A reversed formulation is always possible -- derive a minimum amount of processing, that will at least justify the above systematically benchmarked costs of using a respective type of a parallelised-code syntax-constructor )

    Q.E.D.