dgemm nvblas gpu offload

I had test application that performs matrix multiplication and tried to offload to gpu with nvblas.

#include <armadillo>
#include <iostream>
using namespace arma;
using namespace std;
int main(int argc, char *argv[]) {
    int m = atoi(argv[1]);
    int k = atoi(argv[2]);
    int n = atoi(argv[3]);
    int t = atoi(argv[4]);
    std::cout << "m::" << m << "::k::" << k << "::n::" << n << std::endl;
    mat A;
    A = randu<mat>(m, k);
    mat B;
    B = randu<mat>(k, n);
    mat C;
    C.zeros(m, n);
    cout << "norm c::" << arma::norm(C, "fro") << std::endl;
    tic();
    for (int i = 0; i < t; i++) {
      C = A * B;
    }
    cout << "time taken ::" << toc()/t << endl;
    cout << "norm c::" << arma::norm(C, "fro") << std::endl;
  }

I compiled the code as follows.

CPU

g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cpu.out

GPU

g++ testmm.cpp -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -lnvblas -L$CUDATOOLKIT_HOME/lib64 -o a.cuda.out

When I run the a.cpu.out and a.cuda.out with 4096 4096 4096 both of them taking same time around 11 seconds. I am not seeing a reduction in time with a.gpu.out. In the nvblas.conf, I am leaving everything to default except (a)changing the path for the openblas (b)auto_pin memory enabled. I am the seeing nvblas.log saying using "Devices 0" and no other output. The nvidia-smi is not showing any increase in the gpu activity and nvprof shows a bunch of cudaMalloc's, cudamemcpy, query device capability etc. But any gemm call is not present.

The ldd on the a.cuda.out shows it is linked with nvblas, cublas, cudart and the cpu openblas library. Am I making any mistakes here?

Solution

The order of the linking was a problem there. The problem got resolved when I did the following for the gpu.

GPU

g++ testmm.cpp -lnvblas -L$CUDATOOLKIT_HOME/lib64 -I$ARMADILLO_INCLUDE_DIR -lopenblas -L$OPENBLAS_ROOT/lib/ --std=c+11 -o a.cuda.out

With the above, when I dumped the symbol tables, I see the following output.

nm a.cuda.out | grep -is dgemm
             U cblas_dgemm
             U dgemm_@@libnvblas.so.9.1 <-- this shows correct linking and ability to offload to gpu.

If it is not linked properly, a problematic linking will be as follows.

nm a.cuda.out | grep -is dgemm
             U cblas_dgemm
             U dgemm_  <-- there will not be a libnvblas here showing it is a problem.

Even though ldd will show nvblas, cublas, cudart, openblas in both the above cases, when executing the program, dgemm will always be openblas.