I ran SimpleSpeedTest.py from the PyCuda examples, producing the following output:
Using nbr_values == 8192
Calculating 100000 iterations
SourceModule time and first three results:
0.058294s, [ 0.005477 0.005477 0.005477]
Elementwise time and first three results:
0.102527s, [ 0.005477 0.005477 0.005477]
Elementwise Python looping time and first three results:
2.398071s, [ 0.005477 0.005477 0.005477]
GPUArray time and first three results:
8.207257s, [ 0.005477 0.005477 0.005477]
CPU time measured using :
0.000002s, [ 0.005477 0.005477 0.005477]
The first four time measurements are reasonable, the last one (0.000002s) however is way off. The CPU result should be the slowest one but it is orders of magnitude faster than the fastest GPU method. So obviously the measured time must be wrong. This is strange since the same timing method seems to work fine for the first four results.
So I took some code from SimpleSpeedTest.py and made a small test file [2], which produced:
time measured using option 1:
0.000002s
time measured using option 2:
5.989620s
Option 1 measures the duration using pycuda.driver.Event.record()
(as in SimpleSpeedTest.py), option 2 uses time.clock()
. Again, option 1 is off while option 2 gives a reasonable result (the time it takes to run the test file is around 6s).
Does anyone have an idea as to why this is happening?
Since using option 1 is endorsed in SimpleSpeedTest.py, could it be my setup that is causing the problem? I am running a GTX 470, Display Driver 301.42, CUDA 4.2, Python 2.7 64, PyCuda 2012.1, X5650 Xeon
[2] Test file:
import numpy
import time
import pycuda.driver as drv
import pycuda.autoinit
n_iter = 100000
nbr_values = 8192 # = 64 * 128 (values as used in SimpleSpeedTest.py)
start = drv.Event() # option 1 uses pycuda.driver.Event
end = drv.Event()
a = numpy.ones(nbr_values).astype(numpy.float32) # test data
start.record() # start option 1 (inserting recording points into GPU stream)
tic = time.clock() # start option 2 (using CPU time)
for i in range(n_iter):
a = numpy.sin(a) # do some work
end.record() # end option 1
toc = time.clock() # end option 2
end.synchronize()
events_secs = start.time_till(end)*1e-3
time_secs = toc - tic
print "time measured using option 1:"
print "%fs " % events_secs
print "time measured using option 2:"
print "%fs " % time_secs
I contacted Andreas Klöckner and he suggested to synchronize on the start event, too.
...
start.record()
start.synchronize()
...
And this seems to solve the issue!
time measured using option 1:
5.944461s
time measured using option 2:
5.944314s
Apparently CUDA's behaviour changed in the last two years. I updated SimpleSpeedTest.py.