Search code examples
haskellrepa

Unexpected performance observed in repa-algorithms function


I was testing out the mmultP function from repa-algorithms-3.2.1.1 with the following code (a tad condensed here for brevity):

import Data.Array.Repa hiding            (map)
import Data.Array.Repa.Algorithms.Matrix (mmultP)

import Control.Monad                     (replicateM)
import Control.Arrow                     ((&&&))
import System.Random.MWC                 (initialize, uniformR)
import Control.Monad.ST                  (runST)
import Data.Vector.Unboxed               (singleton)
import Data.Word                         (Word32)

-- Create a couple of dense matrices
genRnds :: Word32 -> [Double]
genRnds seed = runST $ do
    gen <- initialize (singleton seed)
    replicateM (1000 ^ 2) (uniformR (0, 1) gen)

(arr, brr) = head &&& last $ map (fromListUnboxed (Z :. 1000 :. 1000 :: DIM2) . genRnds) [1, 100000]

-- mmultP test
main :: IO ()
main = mmultP arr brr >>= print

and as specified here, compiled using

ghc mmultTest.hs -Odph -rtsopts -threaded -fno-liberate-case -funfolding-use-threshold1000 -funfolding-keeness-factor1000 -fllvm -optlo-O3 -fforce-recomp

Here's the sequential run in the threaded runtime:

$ time ./mmultTest +RTS -K100M > /dev/null
real    0m10.962s
user    0m10.790s
sys     0m0.161s

and here's one using 4 cores (running on a four-core MacBook Air):

$ time ./mmultTest +RTS -N4 -K100M > /dev/null
real    0m13.008s
user    0m18.591s
sys     0m2.067s

Anyone have any intuition as to what's happening here? I also get slower-than-sequential performance for -N2 and -N3; each core seems to add some additional time.

Note that I do observe some minor gains from parallelism on some hand-rolled Repa matrix multiply code.

UPDATE:

Puzzling; I replaced main with

mmultBench :: IO ()
mmultBench  = do 
   results <- mmultP arr brr 
   let reduced = sumAllS results 
   print reduced

and removed the dependency on mwc-random:

(arr, brr) = head &&& last $ map (fromListUnboxed (Z :. 1000 :. 1000 :: DIM2)) (replicate 2 [1..1000000])

A Criterion benchmark with the runtime options -N1 -K100M yields:

mean: 1.361450 s, lb 1.360514 s, ub 1.362915 s, ci 0.950
std dev: 5.914850 ms, lb 3.870615 ms, ub 9.183472 ms, ci 0.950

and -N4 -K100M gives me:

mean: 556.8201 ms, lb 547.5370 ms, ub 573.5012 ms, ci 0.950
std dev: 61.82764 ms, lb 40.15479 ms, ub 102.5329 ms, ci 0.950

Which is a lovely speedup. I would almost think that the previous behaviour was due to writing the resulting 1000x1000 array to stdout, but as I mentioned, I do observe parallelism gains there if I swap in my own matrix multiply code. Still scratching my head.


Solution

  • 1) Printing the matrix to stdout will make the program IO bound. Any speedup figures recorded in this situation will be lies.

    2) There are no 4 core MacBook Airs. They are all 2 core, with 2 hyper-threads per core. Only 2 threads can actually run at a time. Any speedup with > -N2 will be due to latency hiding -- the second hyper-thread on a core can run while the first is stalled on cache-miss.