Benchmarking Theano and CNTK with a simple matrix-vector product on the GPU

I want to compare the performance of Theano and CNTK on a very simple task: matrix-vector product on the GPU. I am using Theano 0.9.0 and CNTK 2.0.

I want to measure the time consumed for the computation on the device only, excluding the time used for data transfer from host to device, or vice versa.

The result I got was something like this: figure (timings theano vs cntk) (N is the number of repetitions. D, the size of the matrix, was set to 10000.)

Question 1:

It seems like the time used for some preparation (compiling the computational graph?) is included in the first execution of the mat-vec product in the CNTK case. Is there any way to split the preparation and the execution in CNTK, like it is in the Theano case?

Question 2:

I am used to Theano, but totally new at CNTK, so I am not quite sure if the CNTK code is equivalent to the Theano code. I am particularly not sure if the operation in the for loop of the CNTK code is really enclosed in the device, since prod.eval() returns a numpy.ndarray. Am I missing something?

Code used to measure the timings:

import numpy as np
import time

# theano
def test_matVecDot_theano(D, N):
    import theano
    import theano.tensor as T
    A_cpu = np.random.normal(size=[D,D]).astype(np.float32)
    x_cpu = np.random.normal(size=[D]).astype(np.float32)
    A_gpu = theano.shared(A_cpu)
    x_gpu = theano.shared(x_cpu)
    b_gpu = theano.shared(x_cpu)
    b_gpu_new = T.dot(A_gpu,x_gpu)
    fnc = theano.function(inputs=[], outputs=None, updates=[(b_gpu, b_gpu_new)], allow_input_downcast=True)
    tic = time.time()
    for i in range(N):
        fnc()
    toc = time.time()
    print("time_theano:",toc-tic)

# cntk
def test_matVecDot_CNTK(D, N):
    import cntk as C
    A_cpu = np.random.normal(size=[D,D]).astype(np.float32)
    x_cpu = np.random.normal(size=[D,1]).astype(np.float32)
    A_c = C.Parameter(init=A_cpu, dtype=np.float32)
    x_c = C.Parameter(init=x_cpu, dtype=np.float32)
    b_c = C.Parameter(init=x_cpu, dtype=np.float32)
    prod = C.times(A_c, x_c)
    tic = time.time()
    for i in range(N):
        b_c.value = prod.eval() # is this operation enclosed in the device?
    toc = time.time()
    print("time_cntk:",toc-tic)

Solution

The short answer is no, the operation is not enclosed on the device. Here's what happens: When you call eval(), the call goes to C++ which does the operation on the device if possible. When coming out of C++, CNTK checks whether the value of as_numpy keyword argument, which by default is True. When as_numpy is True, the gpu buffer is eagerly copied to a NumPy array.

If you call prod.eval(as_numpy=False), then the call to eval will not convert the gpu buffer to a NumPy array. If you assign the result to a plain old variable, you can see that you get a CNTK Value object. However in your code you assign to the .value attribute of b_c. This assignment is handled by the setter of the value property (since this answer is getting a little too technical I'm including this link for the sake of other readers). CNTK does this assignment on the device, although it's hard to tell. This is because if you try to inspect b_c.value if you are calling the .value property getter which is going to give you a NumPy array. So it looks like the result is a NumPy array but this is just a consequence of using b_c.value. Any other variable would let you see it is a CNTK Value object. Again, all this applies to when you do eval(as_numpy=False).

Furthermore, CNTK uses timestamps so the above evaluation happens only once on the GPU. All subsequent N-1 calls to eval() will just return you the same value object (the conversion to Numpy will happen each time though, unless you specify as_numpy=False.

Finally, I don't expect to learn many meaningful lessons from this benchmark: both CNTK and Theano are calling the same CuDNN implementation, the advantages of CNTK are more around higher level things such as (a) comes with a high-level library (b) the user doesn't have to worry about the batch and sequence axes except for a few specialized operations (c) efficient recurrent networks (d) efficient i/o (e) easy distributed training.

And to answer your question about setup time: My understanding is if you just eval the function once, that will compile it. CNTK actually has two kinds of compilations: if you just eval the first time it will compile the forward pass. If you later do function.grad it will throw away the eval compilation and compile it again so that it can handle both the forward and backward pass.