Ubuntu 16.04LTS; SuiteSparse 4.5.5; CUDA 8.0.61 (with performance upate); Nvidia driver 384.98;
I had GPU accelerated CHOLMOD successfully implemented into my code and working fine for several months. Then recently out of the blue (no changes to source code), I started seeing these errors in my output:
GPU failure in cholmod_gpu: gpu_memorysize 8.38861e+06 0 MB
CHOLMOD error: gpu memorysize failure
. file: ../GPU/cholmod_gpu.c line: 384
CHOLMOD error: CUBLAS initialization. file: ../GPU/cholmod_gpu.c line: 433
CHOLMOD error: cudaMemcpy(d_Ls). file: ../Supernodal/../GPU/t_cholmod_gpu.c line: 129
CHOLMOD error: CUDA stream. file: ../Supernodal/../GPU/t_cholmod_gpu.c line: 140
I suspected that a third party library had updated itself unattended. But my test of CHOLMOD/Demo/cholmod_l_demo (with export CHOLMOD_USE_GPU=1) proves that CHOLMOD itself is working perfectly fine and is employing the full GPU (monitoring activity with nvidia-smi). Similarly, the Cuda/samples are all working just fine. I've purged and reinstalled everything including Cuda, Nvidia drivers, and SuiteSparse. Ive tried various combinations of Cuda 8.0 and Cuda 9.0. To no avail, Cuda/samples and CHOLMOD/Demos still work perfectly fine but my CHOLMOD implementation throws the same error.
I've traced the issue to the cudaMemGetInfo() function. For some reason, it is reporting 0 available bytes on the GPU leading to the first error (gpu_memorysize)! The remaining errors seem to cascade off the first. This error is not happening in the CHOLMOD/Demo/cholmod_l_demo script, which suggests there is something wrong with my implementation. Yet, I have changed nothing in my implementation. Does anyone have any idea why cudaMemGetInfo() would report 0 available bytes? I think the answer to this question will help guide me to the solution.
I have looked up my unattended upgrade history and it appears I had some linux-headers and nvidia drivers updated around the time that I started seeing the errors. But I am not so sure the nvidia driver update is to blame since the CHOLMOD/Demo/cholmod_l_demo works perfectly fine. So I suspect it could be a linux-headers issue...
My implementation is spread across several files, so it might be worth looking at the Github commit. But as I mentioned, I have changed none of the source files in comparison to when CHOLMOD gpu acceleration was working for the past couple months.
Any suggestions are greatly appreciated!
The reason cudaMemGetInfo() was reporting 0 free bytes, was because my program did not have executable privileges on libcublas and libcudart. As soon as I run my program with "sudo" preceding it, the GPU is employed and CHOLMOD works as it did a few weeks ago.
I am unsure if the kernel changed the privileges, or if new privileges are required for certain .so installations. It is a bit of a mystery. But the solution is to use "sudo" to run the program.