Parallel Programming using Accelerate (Data.Array.Accelerate)

Situation
Currently I'm working on a project which handles edge detection. Therefore I'd like to calculate the algorithms using accelerated arrays to achieve better performance. Unfortunately, I'm pretty new to functional programming, aswell as parallel programming and I don't really know whats the right way to go.

Problem
To convert a given image into grayscale or even to perform edge detection I need access to each of the pixels / values of the array.

Using non-accelerated arrays (the Data.Array package) I could use the (!)-operator to get the desired value.

Using accelerated arrays (the Data.Array.Accelerate package) there are similar functions like ..

(!) :: (Shape ix, Elt e) => Acc (Array ix e) -> Exp ix -> Exp e
Description: Expression form that extracts a scalar from an array

(!!) :: (Shape ix, Elt e) => Acc (Array ix e) -> Exp Int -> Exp e
Description: Expression form that extracts a scalar from an array at a linear index

.. but they always end in returning Accelerates' expression value (Exp e) which leads to my question ..

Question
Is it possible to 'unpack' the value from the Exp data-type or what else would you recommend me to do?

Example

Converting from image to accelerated array works ..

toArr :: Image PixelRGB8 -> Acc (Array DIM2 (Pixel8, Pixel8, Pixel8))
toArr img = use $ fromFunction (Z :. width :. height) (\(Z :. x :. y) -> let (PixelRGB8 r g b) = pixelAt img x y in (r, g, b))
            where width = imageWidth img
                  height = imageHeight img

.. but I don't know to do it vice versa, because I would need to access the expressions value to generate the image from width/height/pixels.

toJuicy :: Acc (Array DIM2 (Pixel8, Pixel8, Pixel8)) -> Image PixelRGB8 
toJuicy arr = undefined

Any help would be highly appreciated.

Solution

It's important to emphasize that Accelerate isn't just “normal parallelisation” – it's specifically SIMD parallelisation, which works by far best on a GPU. But you can't just read out arbitrary values from GPU memory, at least not without losing all the performance advantage, because that memory totally isn't optimised for random-access but only works properly in “batch mode”. Hence the library's function that do the actual work always return an Acc / Exp value, so intermediate results can actually stay on the GPU (or whatever other parallel-processor).

Now, it is also possible to execute Accelerate code on the CPU, in which case this issue doesn't really arise. But even here the interface is kept consistent; you should carry out the expensive calculation to the end and only in the end retrieve results back into “normal Haskell values”.

To accomplish this retrieval, each of the device-specific backends offers a run function, for instance Data.Array.Accelerate.LLVM.Native.run.