I am trying to parallelize my NN across two GPUs following https://github.com/uoguelph-mlrg/theano_multi_gpu. I have all the dependencies, but the cuda runtime initialization fails with the following message.
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device 0 failed:
cublasCreate() returned this error 'the CUDA Runtime initialization failed'
Error when trying to find the memory information on the GPU: invalid device ordinal
Error allocating 24 bytes of device memory (invalid device ordinal). Driver report 0 bytes free and 0 bytes total
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
CudaNdarray_ZEROS: allocation failed.
Process Process-1:
Traceback (most recent call last):
File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/u/bsankara/nt/Git-nt/nt/train_attention.py", line 171, in launch_train
clip_c=1.)
File "/u/bsankara/nt/Git-nt/nt/nt.py", line 1616, in train
import theano.sandbox.cuda
File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/__init__.py", line 98, in <module>
theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 30, in test_nvidia_driver1
A = cuda.shared_constructor(a)
File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/var.py", line 181, in float32_shared_constructor
enable_cuda=False)
File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py", line 389, in use
cuda_ndarray.cuda_ndarray.CudaNdarray.zeros((2, 3))
RuntimeError: ('CudaNdarray_ZEROS: allocation failed.', 'You asked to force this device and it failed. No fallback to the cpu or other gpu device.')
The relevant part of the code snippet is here:
from multiprocessing import Queue
import zmq
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
def train(private_args, process_env, <some other args>)
if process_env is not None:
os.environ = process_env
####
# pycuda and zmq environment
drv.init()
dev = drv.Device(private_args['ind_gpu'])
ctx = dev.make_context()
sock = zmq.Context().socket(zmq.PAIR)
if private_args['flag_client']:
sock.connect('tcp://localhost:5000')
else:
sock.bind('tcp://*:5000')
####
# import theano stuffs
import theano.sandbox.cuda
theano.sandbox.cuda.use(private_args['gpu'])
import theano
import theano.tensor as tensor
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
import theano.misc.pycuda_init
import theano.misc.pycuda_utils
...
The error is triggered when it imports theano.sandbox.cuda. And this is where, I launch the training function as two processes.
def launch_train(curr_args, process_env, curr_queue, oth_queue):
trainerr, validerr, testerr = train(private_args=curr_args,
process_env=process_env,
...)
process1_env = os.environ.copy()
process1_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu0,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU1"
process2_env = os.environ.copy()
process2_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu1,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU2"
p = Process(target=launch_train,
args=(p_args, process1_env, queue_p, queue_q))
q = Process(target=launch_train,
args=(q_args, process2_env, queue_q, queue_p))
p.start()
q.start()
p.join()
q.join()
The import statement however seem to work if I try to initialize the gpu interactively in Python. I executed the first 20 lines of the train() and it worked fine there and also correctly assigned me to gpu0 as I requested.
After digging around and running pdb, the original poster found the issue.
Basically theano and pycuda were both competing to initialize the gpu, causing the problem. The solution is to first 'import theano', which would get a gpu and then attach to the specific context
in pycuda. So, the import sections within train
function would look like this:
def train(private_args, process_env, <some other args>)
if process_env is not None:
os.environ = process_env
####
# import theano related
# We need global imports and so we make them as such
theano = __import__('theano')
_t_tensor = __import__('theano', globals(), locals(), ['tensor'], -1)
tensor = _t_tensor.tensor
import theano.sandbox.cuda
import theano.misc.pycuda_utils
####
# pycuda and zmq environment
import zmq
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
drv.init()
# Attach the existing context (already initialized by theano import statement)
ctx = drv.Context.attach()
sock = zmq.Context().socket(zmq.PAIR)
if private_args['flag_client']:
sock.connect('tcp://localhost:5000')
else:
sock.bind('tcp://*:5000')
[This answer was added as a community wiki entry from an edit made by the OP in a attempt to get this question off the unaswered list].