python machine-learning gpgpu theano theano-cuda

Understanding Theano Example in terms of GPU cores/threads

I am just getting started with Theano and Deep Learning. I was experimenting with an example from the Theano tutorial (http://deeplearning.net/software/theano/tutorial/using_gpu.html#returning-a-handle-to-device-allocated-data). The example code is shown here:

from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
iters = 1000

rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print(f.maker.fgraph.toposort())
t0 = time.time()
for i in xrange(iters):
    r = f()
t1 = time.time()
print("Looping %d times took %f seconds" % (iters, t1 - t0))
print("Result is %s" % (r,))
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
    print('Used the cpu')
else:
    print('Used the gpu')

I am trying to understand the expression defining 'vlen',

vlen = 10 * 30 * 768  # 10 x #cores x # threads per core

I can't find anywhere in the text that refers to the number of GPU cores specified in this example and why 30 was selected. Nor can I find why the value of 768 threads was used. My GPU (GeForce 840M) has 384 cores. Can I assume that if I substitute 384 in the place of the value of 30, that I will be using all 384 cores ? Also should the value of 768 threads remain fixed ?

Solution

I believe the logic is as follows. Looking at the referenced page, we see that there is mention of a GTX 275 GPU. So the GPU being used for that tutorial may have been a very old CUDA GPU from the cc1.x generation (no longer supported by CUDA 7.0 and 7.5). In the comment, the developer seems to be using the word "core" to refer to a GPU SM (multiprocessor).

There were a number of GPUs in that family that had 30 SMs (a cc1.x SM was a very different animal than a cc 2+ SM), including GTX 275 (240 CUDA cores = 30SMs * 8cores/SM in the cc1.x generation). So the 30 number is derived from the number of SMs in the GPU being used at the time.

Furthermore, if you review old documentation for CUDA versions that supported such GPUs, you will find that cc1.0 and cc1.1 GPUs supported a max of 768 threads per multiprocessor (SM). So I believe this is where the 768 number comes from.

Finally, a good CUDA code will oversubscribe the GPU (total number of threads is more than what the GPU can instantaneously handle). So I believe the factor of 10 is just to ensure "oversubscription".

There is no magic to a particular number -- it is just the length of an array (vlen). The length of this array, after it flows through the theano framework, will ultimately determine the number of threads in CUDA kernel launch. This code isn't really a benchmark or other performance indicator. It's stated purpose is just to demonstrate that the GPU is being used.

So I wouldn't read too much into that number. It was a casual choice by the developer that followed a certain amount of logic pertaining to the GPU at hand.