Solving Ax = b , CUDA vs Matlab

I've done the following test on Matlab:

n = 10000;
A = rand(n,n);
b = rand(n, 1);

tic
y = A\b;
toc

On my Intel i7 gen 5 machine (12 cores) the result is ~ 5 seconds.

Then, I've trying to do the same using CUDA 9.2 sample SDK code (see cuSolverDn_LinearSolver.cpp). Surprisingly, on my Nvidia 970GTX I get ~ 6.5 seconds to get the solution for the same problem size as above!

What is it wrong ? I mention that my matrix is symmetric, square and b has only 1 column. Is there a better way to solve this problem using CUDA? Should I expect greater performance if I'm going to use a newer GPU?

Solution

Here is the code I used to test this

n = 10000;
A = rand(n,n,'single');
b = rand(n, 1,'single');

tic
y = A\b;
toc

A = gpuArray(A);
b = gpuArray(b);

tic
y = A\b;
toc

Here are the results

Elapsed time is 2.673490 seconds.
Elapsed time is 0.553348 seconds.

I am running on a 7700 4 core laptop with a GTX 1060 GPU so approximately the same computing power I think. As you can see in this case the GPU runs faster. The most likely factor is the precision. GPUs only have single precision multipliers while CPUs have double precision multipliers. If you have to do double precision multiplication on a GPU you have to take quite a few multipliers to do the same operation thus drastically slowing down your speed. If I change it so the variables are double precision we now get:

Elapsed time is 5.784525 seconds.
Elapsed time is 5.398702 seconds.

While the GPU is still faster on my computer the point still stands in that the CPU and GPU are much closer together now.