haskell image-processing parallel-processing repa data-parallel-haskell

What are the key differences between the Repa 2 and 3 APIs?

To be more specific, I have the following innocuous-looking little Repa 3 program:

{-# LANGUAGE QuasiQuotes #-}

import Prelude hiding (map, zipWith)
import System.Environment (getArgs)
import Data.Word (Word8)
import Data.Array.Repa
import Data.Array.Repa.IO.DevIL
import Data.Array.Repa.Stencil
import Data.Array.Repa.Stencil.Dim2

main = do
  [s] <- getArgs
  img <- runIL $ readImage s

  let out = output x where RGB x = img
  runIL . writeImage "out.bmp" . Grey =<< computeP out

output img = map cast . blur . blur $ blur grey
  where
    grey              = traverse img to2D luminance
    cast n            = floor n :: Word8
    to2D (Z:.i:.j:._) = Z:.i:.j

---------------------------------------------------------------

luminance f (Z:.i:.j)   = 0.21*r + 0.71*g + 0.07*b :: Float
  where
    (r,g,b) = rgb (fromIntegral . f) i j

blur = map (/ 9) . convolve kernel
  where
    kernel = [stencil2| 1 1 1
                        1 1 1
                        1 1 1 |]

convolve = mapStencil2 BoundClamp

rgb f i j = (r,g,b)
  where
    r = f $ Z:.i:.j:.0
    g = f $ Z:.i:.j:.1
    b = f $ Z:.i:.j:.2

Which takes this much time to process a 640x420 image on my 2Ghz core 2 duo laptop:

real    2m32.572s
user    4m57.324s
sys     0m1.870s

I know something must be quite wrong, because I have gotten much better performance on much more complex algorithms using Repa 2. Under that API, the big improvement I found came from adding a call to 'force' before every array transform (which I understand to mean every call to map, convolve, traverse etc). I cannot quite make out the analogous thing to do in Repa 3 - in fact I thought the new manifestation type parameters are supposed to ensure there is no ambiguity about when an array needs to be forced? And how does the new monadic interface fit into this scheme? I have read the nice tutorial by Don S, but there are some key gaps between the Repa 2 and 3 APIs that are little discussed online AFAIK.

More simply, is there a minimally impactful way to fix the above program's efficiency?

Solution

The new representation type parameters don't automagically force when needed (it's probably a hard problem to do that well) - you still have to force manually. In Repa 3 this is done with the computeP function:

computeP
  :: (Monad m, Repr r2 e, Fill r1 r2 sh e)
  => Array r1 sh e -> m (Array r2 sh e)

I personally really don't understand why it's monadic, because you can just as well use Monad Identity:

import Control.Monad.Identity (runIdentity)
force
  :: (Repr r2 e, Fill r1 r2 sh e)
  => Array r1 sh e -> Array r2 sh e
force = runIdentity . computeP

So, now your output function can be rewritten with appropriate forcing:

output img = map cast . f . blur . f . blur . f . blur . f $ grey
  where ...

with an abbreviation f using a helper function u to aid type inference:

u :: Array U sh e -> Array U sh e
u = id
f = u . force

With these changes, the speedup is quite dramatic - which is to be expected, as without intermediate forcing each output pixel ends up evaluating much more than is necessary (the intermediate values aren't shared).

Your original code:

real    0m25.339s
user    1m35.354s
sys     0m1.760s

With forcing:

real    0m0.130s
user    0m0.320s
sys     0m0.028s

Tested with a 600x400 png, the output files were identical.