Search code examples
matrixoctavenvblas

NVBLAS silently fails for semi-large matrix multiplication


I followed the instructions here to run octave with nvblas. I have CUDA toolkit 7.5 installed and a tesla k40c GPU. To start octave with nvblas, I used LD_PRELOAD=libnvblas.so octave. I then ran the following simple code:

N = 256
A = rand(N,N)
B = rand(N,N)
A*B

which produces a matrix with reasonable values. However, if I increase N to 512, or any number over 512, I get all zeros (or very small numbers) back as a result.

If I use OpenBLAS this does not happen. The matrices should be small enough that they fit in the card's RAM (12GB). Any idea why this might happen?

Note: If I make A and B identity matrices this does not happen, but it still happens with A = B = ones(N,N).


Solution

  • Sorry the question is somewhat stale, but I tried it on an Amazon AWS EC2 p2.xlarge instance with a k80 gpu and it seems to have worked.

    I was getting similar results to you (lots of zeros) when I had the default "NVBLAS_GPU_LIST 0 1" setting in nvblas.conf, which seems to refer to two GPUs, so I changed it to just one and it worked. Complete file below:

    #Put here the CPU BLAS fallback Library of your choice
    NVBLAS_CPU_BLAS_LIB libopenblas.so
    
    # Specify which output log file (default is stderr)
    NVBLAS_LOGFILE nvblas.log
    
    # List of GPU devices Id to participate to the computation
    # By default if no GPU are listed, only device 0 will be used
    NVBLAS_GPU_LIST 0
    NVBLAS_AUTOPIN_MEM_ENABLED
    

    Program (t1.m) slightly modified from the NVidia link, to count the number of non-zeros in the output matrix:

    N = 16384;
    
    # from the original NVidia example:
    #A = single(rand(N,N));
    #B = single(rand(N,N));
    
    # double precision seems to work fine (not checked in detail)
    A = rand(N,N);
    B = rand(N,N);
    
    start = clock();
    C = A * B;
    elapsedTime = etime(clock(), start);
    disp(elapsedTime);
    gFlops = 2*N*N*N/(elapsedTime * 1e+9);
    disp(gFlops);
    
    disp("number of elements >0:")
    disp(sum(sum(C > 0)));
    
    disp("Should be:")
    disp(N*N)
    

    FYI Here is the nvidia-smi output while it was running as above (it peaked at 172MiB usage with N=16384):

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
    | N/A   44C    P0    80W / 149W |     80MiB / 11439MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID  Type  Process name                               Usage      |
    |=============================================================================|
    |    0     21080    C   /usr/bin/octave-cli                             78MiB |
    +-----------------------------------------------------------------------------+
    

    Here are the nvidia & cuda files I'd previously installed:

    cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb  
    libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb
    libcudnn5_5.1.10-1+cuda8.0_amd64.deb                   
    nvidia-driver-local-repo-ubuntu1604_375.51-1_amd64.deb
    

    I seem to get a speed up of about 8.6, with about 55 gflops from plain octave, and 478 from the GPU version.