performance haskell concurrency parallel-processing fold

Is runInBoundThread the best tool for parallelism?

Say, I want to fold monoids in parallel. My computer has 8 cores. I have this function to split a list into equal-sized smaller lists (with bounded modulo-bias):

import Data.List

parallelize :: Int -> [a] -> [[a]]
parallelize 0 _ = []
parallelize n [] = replicate n []
parallelize n xs = let
    (us,vs) = splitAt (quot (length xs) n) xs
    in us : parallelize (n-1) vs

The first version of parallel fold I made was:

import Control.Concurrent
import Control.Concurrent.QSemN
import Data.Foldable
import Data.IORef

foldP :: Monoid m => [m] -> IO m
foldP xs = do
    result <- newIORef mempty
    sem <- newQSemN 0
    n <- getNumCapabilities
    let yss = parallelize n xs
    for_ yss (\ys -> forkIO (modifyIORef result (fold ys <>) >> signalQSemN sem 1))
    waitQSemN sem n
    readIORef result

But usage of IORefs and semaphores seemed ugly to me. So I made another version:

import Data.Traversable

foldP :: Monoid m => [m] -> IO m
foldP xs = do
    n <- getNumCapabilities
    let yss = parallelize n xs
    rs <- for yss (\ys -> runInUnboundThread (return (fold ys)))
    return (fold rs)

The test code I used is:

import Data.Monoid
import System.CPUTime

main :: IO ()
main = do
    start <- getCPUTime
    Product result <- foldP (fmap Product [1 .. 100])
    end <- getCPUTime
    putStrLn ("Time took: " ++ show (end - start) ++ "ps.")
    putStrLn ("Result: " ++ show result)

The second version of foldP outperformed the first version. When I used runInBoundThread instead of runInUnboundThread, it became even faster.

By what are these performance differences made?

Solution

TLDR; Use fold function from massiv package and you will likely get the most efficient solution in Haskell.

I would like to start by saying that the first thing that people forget when trying to implement concurrent patterns like this is exception handling. In the solution from the question the exception handling is non-existent thus it is totally wrong. Therefore I'd recommend to use existing implementations for common concurrency patterns. async is the goto library for concurrency, but for such use case it will not be the most efficient solution.

This particular example can easily be solved with scheduler package, in fact it is exactly the kind of stuff it was designed for. Here is how you can use it to achieve folding of monoids:

import Control.Scheduler
import Control.Monad.IO.Unlift

foldP :: (MonadUnliftIO m, Monoid n) => Comp -> [n] -> m n
foldP comp xs = do
  rs <-
    withScheduler comp $ \scheduler ->
      mapM_ (scheduleWork scheduler . pure . fold) (parallelize (numWorkers scheduler) xs)
  pure $ fold rs

See the Comp type for explanation on best parallelization strategies. From what I found in practice Par will usually work best, because it will use pinned threads created with forkOn

Note that the parallelize function is implemented inefficiently and dangerously as well, it is better to write it this way:


parallelize :: Int -> [a] -> [[a]]
parallelize n' xs' = go 0 id xs'
  where
    n = max 1 n'
    -- at least two elements make sense to get benefit of parallel fold
    k = max 2 $ quot (length xs') n
    go i acc xs
      | null xs = acc []
      | i < n =
        case splitAt k xs of
          (ls, rs) -> go (i + 1) (acc . (ls :)) rs
      | otherwise = acc . (xs:) $ []

One more bit of advise is that list is far from ideal data structure for parallelization and efficiency in general. In order to split the lists into chunks before parallelizing computation you already have to go through the data structure with parallelize, which can be avoided if you were to use an array. What I am getting at is use an array instead, as suggested in the beginning of this answer.