How to make Alea faster?

Having done some work implementing various ML algorithms in Alea, I've tried benchmarking some simple, but essential routines in Alea. It surprised me to find out that Alea' takes roughly 3x longer than the equivalent cuBLAS call to sgeam do the same thing. If I was doing something more complex like matrix multiplication where I had to juggle shared memory, this would've been understandable, but the following are just simple array transformations.

let dmat = createRandomUniformMatrix 100 1000 1.0f 0.0f
let dmat2 = createRandomUniformMatrix 100 1000 1.0f 0.0f
let rmat = createEmptyMatrixLike dmat

let m = new DeviceUnaryTransformModule<float32> <@ fun x -> x*2.0f @>

#time
//4.85s/100k
for i=1 to 100000 do
    m.Apply(dmat, rmat) |> ignore
#time

#time
//1.8s/100k
for i=1 to 100000 do
    sgeam2 nT nT 2.0f dmat 0.0f dmat2 rmat  |> ignore
#time

DeviceUnaryTransformModule transform module's kernel is the same as from the basic transform example, the only difference is that afterwards instead of gathering to host, it keeps the data on the device.

Also the Unbound's reduce works really poorly for me, so badly in fact that there must be an error in the way I've been using it. It is roughly 20x slower than using sgeamv twice to sum a matrix.

let makeReduce (op:Expr<'T -> 'T -> 'T>)  =
    let compileReductionKernel (op:Expr<'T -> 'T -> 'T>) =
        worker.LoadProgram(
                        DeviceReduceImpl.DeviceReduce(op, worker.Device.Arch, PlatformUtil.Instance.ProcessBitness).Template
                        )

    let prog = compileReductionKernel op

    let runReduceProgram (sumProg : Program<DeviceReduceImpl.IDeviceReduceFactory<'A>>) (x: DeviceMemory<'A>) = 
        sumProg.Entry.Create(blob, x.Length)
               .Reduce(None, x.Ptr, x.Length)

    let reduceProg (x: DeviceMemory<'T>) = runReduceProgram prog x
    reduceProg

let sumReduce: DeviceMemory<float32> -> float32 = makeReduce <@ fun (a:float32) b -> a + b @>

#time
//3.5s/10k
for i=1 to 10000 do
    sumReduce dmat.dArray |> ignore
#time

I haven't tried comparing this with CUDA C++, but for simple things I think it should be on par with cuBLAS. I thought that the optimization flag might have been off, but then found it was on by default. Any optimization tips I am missing here?

Solution

I think there are some issues in your testing code:

In your mapping module, you should preload the GPUModule. GPUModule is JIT-compiled when the first time it is launched. So actually your timing measument includes the GPU code compiling time;
In your mapping module, both Alea code and the cublas code, you should sync the worker (sync the CUDA context). The CUDA programming is async style. So when you launch the kernel, it returns immediately without waiting for the kernel to complete. If you don't sync the worker, actually you are measuring the kernel launching time, not the kernel executing time. Which Alea gpu's launching time will be slower than native C code, since it will do some marshaling of the kernel arguments. There are some other issues related to the kernel launching time which I will show you in following example code.
Your reduce test actually load the reduce module each time! which means, each time you do reduction, you measures the time including the GPU compiling time! It is suggested that you make the instance of GPU module or Program to be long lived, since they represent the compiled GPU code.

So, I did a test following your usage. Here I first list the full test code:

#r @"packages\Alea.CUDA.2.1.2.3274\lib\net40\Alea.CUDA.dll"
#r @"packages\Alea.CUDA.IL.2.1.2.3274\lib\net40\Alea.CUDA.IL.dll"
#r @"packages\Alea.CUDA.Unbound.2.1.2.3274\lib\net40\Alea.CUDA.Unbound.dll"
#r "System.Configuration"
open System.IO
Alea.CUDA.Settings.Instance.Resource.AssemblyPath <- Path.Combine(@"packages\Alea.CUDA.2.1.2.3274", "private")
Alea.CUDA.Settings.Instance.Resource.Path <- Path.GetTempPath()

open Alea.CUDA
open Alea.CUDA.Utilities
open Alea.CUDA.CULib
open Alea.CUDA.Unbound
open Microsoft.FSharp.Quotations

type MapModule(target, op:Expr<float32 -> float32>) =
    inherit GPUModule(target)

    [<Kernel;ReflectedDefinition>]
    member this.Kernel (C:deviceptr<float32>) (A:deviceptr<float32>) (B:deviceptr<float32>) (n:int) =
        let start = blockIdx.x * blockDim.x + threadIdx.x
        let stride = gridDim.x * blockDim.x
        let mutable i = start
        while i < n do
            C.[i] <- __eval(op) A.[i] + __eval(op) B.[i]
            i <- i + stride

    member this.Apply(C:deviceptr<float32>, A:deviceptr<float32>, B:deviceptr<float32>, n:int) =
        let lp = LaunchParam(64, 256)
        this.GPULaunch <@ this.Kernel @> lp C A B n

let inline mapTemplate (op:Expr<'T -> 'T>) = cuda {
    let! kernel = 
        <@ fun (C:deviceptr<'T>) (A:deviceptr<'T>) (B:deviceptr<'T>) (n:int) ->
            let start = blockIdx.x * blockDim.x + threadIdx.x
            let stride = gridDim.x * blockDim.x
            let mutable i = start
            while i < n do
                C.[i] <- (%op) A.[i] + (%op) B.[i]
                i <- i + stride @>
        |> Compiler.DefineKernel

    return Entry(fun program ->
        let worker = program.Worker
        let kernel = program.Apply kernel
        let lp = LaunchParam(64, 256)

        let run C A B n =
            kernel.Launch lp C A B n

        run ) }

let test1 (worker:Worker) m n sync iters =
    let n = m * n
    use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
    let rng = System.Random(42)
    use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use C = worker.Malloc<float32>(n)
    let timer = System.Diagnostics.Stopwatch.StartNew()
    for i = 1 to iters do
        m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
    if sync then worker.Synchronize()
    timer.Stop()
    printfn "%f ms / %d %s (no pre-load module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test2 (worker:Worker) m n sync iters =
    let n = m * n
    use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
    // we pre-load the module, this will JIT compile the GPU code
    m.GPUForceLoad()
    let rng = System.Random(42)
    use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use C = worker.Malloc<float32>(n)
    let timer = System.Diagnostics.Stopwatch.StartNew()
    for i = 1 to iters do
        m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
    if sync then worker.Synchronize()
    timer.Stop()
    printfn "%f ms / %d %s (pre-loaded module)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test3 (worker:Worker) m n sync iters =
    let n = m * n
    use m = new MapModule(GPUModuleTarget.Worker(worker), <@ fun x -> x * 2.0f @>)
    // we pre-load the module, this will JIT compile the GPU code
    m.GPUForceLoad()
    let rng = System.Random(42)
    use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use C = worker.Malloc<float32>(n)
    // since the worker is running in a background thread
    // each cuda api will switch to that thread
    // use eval() to avoid the many thread switching
    worker.Eval <| fun _ ->
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            m.Apply(C.Ptr, A.Ptr, B.Ptr, n)
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (pre-loaded module + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test4 (worker:Worker) m n sync iters =
    use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
    let n = m * n
    let rng = System.Random(42)
    use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use C = worker.Malloc<float32>(n)
    let timer = System.Diagnostics.Stopwatch.StartNew()
    for i = 1 to iters do
        program.Run C.Ptr A.Ptr B.Ptr n
    if sync then worker.Synchronize()
    timer.Stop()
    printfn "%f ms / %d %s (template usage)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test5 (worker:Worker) m n sync iters =
    use program = worker.LoadProgram(mapTemplate <@ fun x -> x * 2.0f @>)
    let n = m * n
    let rng = System.Random(42)
    use A = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use B = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use C = worker.Malloc<float32>(n)
    worker.Eval <| fun _ ->
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            program.Run C.Ptr A.Ptr B.Ptr n
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (template usage + worker.Eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test6 (worker:Worker) m n sync iters =
    use cublas = new CUBLAS(worker)
    let rng = System.Random(42)
    use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
    use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
    use dmatr = worker.Malloc<float32>(m * n)
    let timer = System.Diagnostics.Stopwatch.StartNew()
    for i = 1 to iters do
        cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
    if sync then worker.Synchronize()
    timer.Stop()
    printfn "%f ms / %d %s (cublas)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test7 (worker:Worker) m n sync iters =
    use cublas = new CUBLAS(worker)
    let rng = System.Random(42)
    use dmat1 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
    use dmat2 = worker.Malloc(Array.init (m * n) (fun _ -> rng.NextDouble() |> float32))
    use dmatr = worker.Malloc<float32>(m * n)
    worker.Eval <| fun _ ->
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to iters do
            cublas.Sgeam(cublasOperation_t.CUBLAS_OP_N, cublasOperation_t.CUBLAS_OP_N, m, n, 2.0f, dmat1.Ptr, m, 2.0f, dmat2.Ptr, m, dmatr.Ptr, m)
        if sync then worker.Synchronize()
        timer.Stop()
        printfn "%f ms / %d %s (cublas + worker.eval)" timer.Elapsed.TotalMilliseconds iters (if sync then "sync" else "nosync")

let test worker m n sync iters =
    test6 worker m n sync iters
    test7 worker m n sync iters
    test1 worker m n sync iters
    test2 worker m n sync iters
    test3 worker m n sync iters
    test4 worker m n sync iters
    test5 worker m n sync iters

let testReduce1 (worker:Worker) n iters =
    let rng = System.Random(42)
    use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
    // JIT compile and load GPU code for this module
    reduceModule.GPUForceLoad()
    // create a reducer which will allocate temp memory for maxNum=n
    let reduce = reduceModule.Create(n)
    let timer = System.Diagnostics.Stopwatch.StartNew()
    for i = 1 to 10000 do
        reduce.Reduce(input.Ptr, n) |> ignore
    timer.Stop()
    printfn "%f ms / %d (pre-load gpu code)" timer.Elapsed.TotalMilliseconds iters

let testReduce2 (worker:Worker) n iters =
    let rng = System.Random(42)
    use input = worker.Malloc(Array.init n (fun _ -> rng.NextDouble() |> float32))
    use reduceModule = new DeviceReduceModule<float32>(GPUModuleTarget.Worker(worker), <@ (+) @>)
    // JIT compile and load GPU code for this module
    reduceModule.GPUForceLoad()
    // create a reducer which will allocate temp memory for maxNum=n
    let reduce = reduceModule.Create(n)
    worker.Eval <| fun _ ->
        let timer = System.Diagnostics.Stopwatch.StartNew()
        for i = 1 to 10000 do
            reduce.Reduce(input.Ptr, n) |> ignore
        timer.Stop()
        printfn "%f ms / %d (pre-load gpu code and avoid thread switching)" timer.Elapsed.TotalMilliseconds iters

let testReduce worker n iters =
    testReduce1 worker n iters
    testReduce2 worker n iters

let workerDefault = Worker.Default
let workerNoThread = Worker.CreateOnCurrentThread(Device.Default)

In Alea GPU, a worker represents a CUDA context, and currently, we are using the pattern that one GPU uses one dedicated thread, and on that thread the CUDA context is attached. We call this "worker with dedicated thread". So this also means, each time you call a CUDA API, such as kernel launching, we have to switch to the worker thread. If you are doing a lot of kernel launching, it is suggest to use Worker.Eval function to execute your code inside the worker thread to avoid thread switching. There is also an experimental feature of creating worker on current thread, which thus avoid the thread switching, but we are still optimize this usage. For more detail, please reference here

Now we first use the default worker to do a test without sync the worker (thus it means we are comparing the kernel launching time only). The default worker is a worker with dedicated thread, so you can see it performances better when we use Worker.Eval. But overall, kernel launching from .net is slower than the native C kernel launching:

> test workerDefault 10000 10000 false 100;;
4.487300 ms / 100 nosync (cublas)
0.560600 ms / 100 nosync (cublas + worker.eval)
304.427900 ms / 100 nosync (no pre-load module)
18.517000 ms / 100 nosync (pre-loaded module)
12.579100 ms / 100 nosync (pre-loaded module + worker.eval)
27.023800 ms / 100 nosync (template usage)
16.007500 ms / 100 nosync (template usage + worker.Eval)
val it : unit = ()
> test workerDefault 10000 10000 false 100;;
3.288600 ms / 100 nosync (cublas)
0.647300 ms / 100 nosync (cublas + worker.eval)
29.129100 ms / 100 nosync (no pre-load module)
18.874700 ms / 100 nosync (pre-loaded module)
12.285000 ms / 100 nosync (pre-loaded module + worker.eval)
20.452300 ms / 100 nosync (template usage)
14.903500 ms / 100 nosync (template usage + worker.Eval)
val it : unit = ()

Also, you might noticed, I run this test twice, and in the first time, the test without pre-loaded module uses 304 ms, but in the second time, the test without pre-loaded module uses only 29 ms. The reason is, we use LLVM P/Invoke to compile the kernel. And those P/Invoke function are lazy function, so they have some initialization when you first use it, after that, it becomes faster.

Now, we sync the worker, which actually measured the real kernel executing time, now they are similar. The kernel I created here is very simple, but it does op on both matrix A and B:

> test workerDefault 10000 10000 true 100;;
843.695000 ms / 100 sync (cublas)
841.452400 ms / 100 sync (cublas + worker.eval)
919.244900 ms / 100 sync (no pre-load module)
912.348000 ms / 100 sync (pre-loaded module)
908.909000 ms / 100 sync (pre-loaded module + worker.eval)
914.834100 ms / 100 sync (template usage)
914.170100 ms / 100 sync (template usage + worker.Eval)

Now if we test them on threadless worker, they will be a little bit fast, since there is no thread-switching:

> test workerNoThread 10000 10000 true 100;;
842.132100 ms / 100 sync (cublas)
841.627200 ms / 100 sync (cublas + worker.eval)
918.007800 ms / 100 sync (no pre-load module)
908.575900 ms / 100 sync (pre-loaded module)
908.770100 ms / 100 sync (pre-loaded module + worker.eval)
913.405300 ms / 100 sync (template usage)
913.942600 ms / 100 sync (template usage + worker.Eval)

Now here is test over reduce:

> testReduce workerDefault 10000000 100;;
7691.335300 ms / 100 (pre-load gpu code)
6448.782500 ms / 100 (pre-load gpu code and avoid thread switching)
val it : unit = ()
> testReduce workerNoThread 10000000 100;;
6467.105300 ms / 100 (pre-load gpu code)
6426.296900 ms / 100 (pre-load gpu code and avoid thread switching)
val it : unit = ()

Please note, in this reduction test, there are one memory gathering (memcpyDtoH) for each reduction to get the result from device to host. and this memory copy API calling automatically sync the worker because if the kernel isn't finished, the value is meaningless. So if you want to compare the performance with C code, you should also copy the result scalar from device to host. although it is just one CUDA api invoking, but as you did it for many iterations (100 in this example), it will acuumulate some timing there.

Hope this answeres your question.