CUDA runtime gpu initialization with theano

I am trying to parallelize my NN across two GPUs following https://github.com/uoguelph-mlrg/theano_multi_gpu. I have all the dependencies, but the cuda runtime initialization fails with the following message.

ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device 0 failed:
cublasCreate() returned this error 'the CUDA Runtime initialization failed'
Error when trying to find the memory information on the GPU: invalid device ordinal
Error allocating 24 bytes of device memory (invalid device ordinal). Driver report 0 bytes free and 0 bytes total
ERROR (theano.sandbox.cuda): ERROR: Not using GPU. Initialisation of device gpu failed:
CudaNdarray_ZEROS: allocation failed.
Process Process-1:
Traceback (most recent call last):
  File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/opt/share/Python-2.7.9/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/u/bsankara/nt/Git-nt/nt/train_attention.py", line 171, in launch_train
    clip_c=1.)
  File "/u/bsankara/nt/Git-nt/nt/nt.py", line 1616, in train
    import theano.sandbox.cuda
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/__init__.py", line 98, in <module>
    theano.sandbox.cuda.tests.test_driver.test_nvidia_driver1()
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/tests/test_driver.py", line 30, in test_nvidia_driver1
    A = cuda.shared_constructor(a)
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/var.py", line 181, in float32_shared_constructor
    enable_cuda=False)
  File "/opt/share/Python-2.7.9/lib/python2.7/site-packages/theano/sandbox/cuda/__init__.py", line 389, in use
    cuda_ndarray.cuda_ndarray.CudaNdarray.zeros((2, 3))
RuntimeError: ('CudaNdarray_ZEROS: allocation failed.', 'You asked to force this device and it failed. No fallback to the cpu or other gpu device.')

The relevant part of the code snippet is here:

from multiprocessing import Queue
import zmq
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray

def train(private_args, process_env, <some other args>)
    if process_env is not None:
       os.environ = process_env

    ####
    # pycuda and zmq environment

    drv.init()
    dev = drv.Device(private_args['ind_gpu'])
    ctx = dev.make_context()
    sock = zmq.Context().socket(zmq.PAIR)

    if private_args['flag_client']:
        sock.connect('tcp://localhost:5000')
    else:
        sock.bind('tcp://*:5000')

    ####
    # import theano stuffs
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(private_args['gpu'])

    import theano
    import theano.tensor as tensor
    from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
    import theano.misc.pycuda_init
    import theano.misc.pycuda_utils
...

The error is triggered when it imports theano.sandbox.cuda. And this is where, I launch the training function as two processes.

def launch_train(curr_args, process_env, curr_queue, oth_queue):
    trainerr, validerr, testerr = train(private_args=curr_args,
                                        process_env=process_env,
                                         ...)

process1_env = os.environ.copy()
process1_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu0,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU1"
process2_env = os.environ.copy()
process2_env['THEANO_FLAGS'] = "cuda.root=/opt/share/cuda-7.0,device=gpu1,floatX=float32,on_unused_input=ignore,optimizer=fast_run,exception_verbosity=high,compiledir=/u/bsankara/.theano/NT_multi_GPU2"

p = Process(target=launch_train,
                args=(p_args, process1_env, queue_p, queue_q))
q = Process(target=launch_train,
                args=(q_args, process2_env, queue_q, queue_p))

p.start()
q.start()
p.join()
q.join()

The import statement however seem to work if I try to initialize the gpu interactively in Python. I executed the first 20 lines of the train() and it worked fine there and also correctly assigned me to gpu0 as I requested.

Solution

After digging around and running pdb, the original poster found the issue.

Basically theano and pycuda were both competing to initialize the gpu, causing the problem. The solution is to first 'import theano', which would get a gpu and then attach to the specific context in pycuda. So, the import sections within train function would look like this:

def train(private_args, process_env, <some other args>)
    if process_env is not None:
       os.environ = process_env

    ####
    # import theano related
    # We need global imports and so we make them as such
    theano = __import__('theano')
    _t_tensor = __import__('theano', globals(), locals(), ['tensor'], -1)
    tensor = _t_tensor.tensor

    import theano.sandbox.cuda
    import theano.misc.pycuda_utils

    ####
    # pycuda and zmq environment
    import zmq
    import pycuda.driver as drv
    import pycuda.gpuarray as gpuarray

    drv.init()
    # Attach the existing context (already initialized by theano import statement)
    ctx = drv.Context.attach()
    sock = zmq.Context().socket(zmq.PAIR)

    if private_args['flag_client']:
        sock.connect('tcp://localhost:5000')
    else:
        sock.bind('tcp://*:5000')

[This answer was added as a community wiki entry from an edit made by the OP in a attempt to get this question off the unaswered list].