Theano: cublasSgemm failed (14) an internal operation failed

Sometimes, after a while of running fine, I get such an error with Theano / CUDA:

RuntimeError: cublasSgemm failed (14) an internal operation failed
 unit=0 N=0, c.dims=[512 2048], a.dim=[512 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(512, 493), (493, 2048)]
Inputs strides: [(493, 1), (2048, 1)]
Inputs values: ['not shown', 'not shown']

As my code runs fine for a while (I do Neural Network training, and it runs most of the time through, and even when this error occurred, it already ran fine for >2000 mini-batches), I wonder about the cause of this. Maybe some hardware fault?

This is with CUDA 6.0 and a very recent Theano (yesterday from Git), Ubuntu 12.04, GTX 580.

I also got the error with CUDA 6.5 on a K20:

RuntimeError: cublasSgemm failed (14) an internal operation failed
 unit=0 N=0, c.dims=[2899 2000], a.dim=[2899 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2899, 493), (493, 2000)]
Inputs strides: [(493, 1), (2000, 1)]
Inputs values: ['not shown', 'not shown']

(Another error I sometimes got in the past is this now instead. Not sure if this is related.)

Via Markus, who got the same error:

RuntimeError: cublasSgemm failed (14) an internal operation failed
 unit=0 N=0, c.dims=[2 100], a.dim=[2 9919], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuFlatten{2}.0, weight_hidden_)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2, 9919), (9919, 100)]
Inputs strides: [(9919, 1), (100, 1)]
Inputs values: ['not shown', 'not shown']

With CUDA 6.5, Windows 8.1, Python 2.7, GTX 970M.

The error only occurs in my own network, if I run the LeNet example from Theano, it runs fine. Though the network is compiling and running fine on the CPU (and also on the GPU for some colleagues using Linux). Does anyone have an idea what the problem could be?

Solution

Just for reference in case anyone stumbles upon this:

This doesn't occur anymore for me. I'm not exactly sure what fixed it, but I think the main difference is that I avoid any multithreading and forks (without exec). This caused many similar problems, e.g. Theano CUDA error: an illegal memory access was encountered (StackOverflow), and Theano CUDA error: an illegal memory access was encountered (Google Groups discussion). Esp. that discussion on Google Groups is very helpful.

Theano functions are not multithreading safe. However, that is not a problem for me because I'm only using it in one thread. However, I still think that other threads might cause these problems. Maybe it is related to the GC of Python which frees some Cuda_Ndarray in some other thread while the theano.function is running.

I looked a bit at the relevant Theano code and not sure if it covers all such cases.

Note that you might even not be aware that you have some background threads. Some Python stdlib code can spawn such background threads. E.g. multiprocessing.Queue will do that.

I cannot avoid having multiple threads, and until this is fixed in Theano, I create a new subprocess with a single thread where I do all the Theano work. This also has several advantages such as: More clear separation of the code, being faster in some cases because it all really runs in parallel, and being able to use multiple GPUs.

Note that just using the multiprocessing module did not work for me that well because there are a few libs (Numpy and others, and maybe Theano itself) which might behave bad in a forked process (depending on the versions, the OS and race conditions). Thus, I needed a real subprocess (fork + exec, not just fork).

My code is here, in case anyone is interested in this.

There is ExecingProcess which is modeled after multiprocessing.Process but does a fork+exec. (Btw, on Windows, the multiprocessing module will anyway do this, because there is no fork on Windows.) And there is AsyncTask which adds up a duplex pipe to this which works with both ExecingProcess and the standard multiprocessing.Process.