I'm currently working on a dot product with pinned memory using PyCUDA. And I have a problem with big arrays.
I'm working with:
The code is:
#!/usr/bin/env python
import numpy as np
import argparse
import math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from time import time
dot_mod = SourceModule("""
__global__ void full_dot( double* v1, double* v2, double* out, int N ) {
__shared__ double cache[ 1024 ];
int i = blockIdx.x * blockDim.x + threadIdx.x;
cache[ threadIdx.x ] = 0.f;
while( i < N ) {
cache[ threadIdx.x ] += v1[ i ] * v2[ i ];
i += gridDim.x * blockDim.x;
}
__syncthreads(); // required because later on the current thread is accessing
// data written by another thread
i = 1024 / 2;
while( i > 0 ) {
if( threadIdx.x < i ) cache[ threadIdx.x ] += cache[ threadIdx.x + i ];
__syncthreads();
i /= 2; //not sure bitwise operations are actually faster
}
#ifndef NO_SYNC // serialized access to shared data;
if( threadIdx.x == 0 ) atomicAdd( out, cache[ 0 ] );
#else // no sync, what most likely happens is:
// 1) all threads read 0
// 2) all threads write concurrently 16 (local block dot product)
if( threadIdx.x == 0 ) *out += cache[ 0 ];
#endif
}
""")
def main(args):
dot = dot_mod.get_function("full_dot")
N = args.number
BLOCK_SIZE = 1024
BLOCKS = int(math.ceil(N/BLOCK_SIZE))
THREADS_PER_BLOCK = BLOCK_SIZE
# Time use of pinned host memory:
x = drv.aligned_empty((N), dtype=np.float64, order='C')
x = drv.register_host_memory(x, flags=drv.mem_host_register_flags.DEVICEMAP)
x_gpu_ptr = np.intp(x.base.get_device_pointer())
# Time use of pinned host memory:
y = drv.aligned_empty((N), dtype=np.float64, order='C')
y = drv.register_host_memory(y, flags=drv.mem_host_register_flags.DEVICEMAP)
y_gpu_ptr = np.intp(y.base.get_device_pointer())
# Time use of pinned host memory:
z = drv.aligned_empty((1), dtype=np.float64, order='C')
z = drv.register_host_memory(z, flags=drv.mem_host_register_flags.DEVICEMAP)
z_gpu_ptr = np.intp(z.base.get_device_pointer())
z[:] = np.zeros(1)
x[:] = np.zeros(N)
y[:] = np.zeros(N)
x[:] = np.random.rand(N)
y[:] = x[:]
x_orig = x.copy()
y_orig = y.copy()
start = time()
dot(x_gpu_ptr, y_gpu_ptr, z_gpu_ptr, np.uint32(N), block=(THREADS_PER_BLOCK, 1, 1), grid=(BLOCKS,1))
times = time()-start
print "Average kernel GPU dot product execution time with pinned memory: %3.7f" % np.mean(times)
start = time()
ydot=np.dot(x_orig,y_orig)
times = time()-start
print "Average numpy dot product execution time: %3.7f" % np.mean(times)
print N,ydot,z[0]
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=' ')
parser.add_argument('-n', dest='number', type=long, help="Number of samples ", required=True)
args = parser.parse_args()
main(args)
I have wrote this code that works well when sample array is of size aprox less than 1024*12, but for big arrays like 1024*1024 value gives a wrong result.-
➜ ./test_dot_pinned.py -n 16384
Average kernel GPU dot product execution time with pinned memory: 0.0001669
Average numpy dot product execution time: 0.0000119
16384 5468.09590706 5468.09590706
SIZE np.dot() GPU-dot-pinned
➜ ./test_dot_pinned.py -n 1048576
Average kernel GPU dot product execution time with pinned memory: 0.0002351
Average numpy dot product execution time: 0.0010922
1048576 349324.532564 258321.148593
SIZE np.dot() GPU-dot-pinned
Thanks to everyone, I hope someone could help me.
pycuda doesn't enforce any synchronization after a kernel launch. Normally, if you do a device->host copy of data after a kernel launch, the operation will force a synchronization, i.e. it will force the kernel to complete.
But you have no such synchronization in your code. Since you are using pinned memory, as the kernel execution time grows (due to larger work size) eventually when you print out z[0]
you are only getting a partial result, because the kernel is not finished at that point.
A side effect of this also is that your kernel time measurement is not accurate.
You can fix both of these by forcing the kernel to complete before you finish your time measurement:
dot(x_gpu_ptr, y_gpu_ptr, z_gpu_ptr, np.uint32(N), block=(THREADS_PER_BLOCK, 1, 1), grid=(BLOCKS,1))
#add the next line of code:
drv.Context.synchronize()
times = time()-start