I'm trying to do a simple matrix-vector product with OpenCL, using the ViennaCL library.
Here's my main:
#include "viennacl/scalar.hpp"
#include "viennacl/vector.hpp"
#include "viennacl/matrix.hpp"
#include "viennacl/linalg/prod.hpp"
#include "viennacl/matrix_proxy.hpp"
#include "viennacl/linalg/lu.hpp"
int main()
{
viennacl::ocl::set_context_device_type(0, viennacl::ocl::gpu_tag());
std::vector<viennacl::ocl::device> devices = viennacl::ocl::current_context().devices();
viennacl::ocl::current_context().switch_device(devices[0]);
int Nx=10;
int Ny=10;
//GPU vectors
viennacl::matrix<float> vcl_A(Nx,Ny);
viennacl::vector<float> vcl_b(Ny);
viennacl::vector<float> vcl_c(Nx);
//CPU vectors
std::vector<float> stl_A(Nx*Ny);
std::vector<float> stl_b(Ny);
std::vector<float> stl_c(Nx);
//filling CPU vectors
for (unsigned int i = 0; i < Nx; ++i)
for (unsigned int j = 0; j < Ny; ++j)
stl_A[i*Ny + j] = (float) (rand()%100);
for (unsigned int i = 0; i < stl_b.size(); ++i)
stl_b[i] = (float) (rand()%100);
//copying input data to GPU
viennacl::fast_copy(&(stl_A[0]),
&(stl_A[0]) + stl_A.size(),
vcl_A);
viennacl::fast_copy(stl_b, vcl_b);
//launching product c = A*b
vcl_c = viennacl::linalg::prod(vcl_A, vcl_b);
//copying output data back to CPU
viennacl::copy(vcl_c, stl_c);
viennacl::backend::finish();
}
Afterwards, my stl_c vector has its first coefficient correctly computed, but every 9 other coefs are 0
.
When I change the dimensions to upper values, I get more than one right coef at the begining of my vector, but I still get a bunch of zeros for every other coefs.
I'm guessing some of my copies are done the wrong way, but maybe my prod operation is in cause (local/globalsize issue, but I assume ViennaCL takes care of it all)
Any Idea of what i'm doing wrong? Any help or advice will be really appreciated.
(I'm running the code on VS 2012, my GPU is an NVIDIA Geforce gtx 670)
1. The problem:
The documentation for viennacl::matrix
in the page manual-types-matrix
states:
The internal memory buffer of a
matrix<>
is by default padded with zeros so that the internal matrix size is a multiple of e.g. a power of two. When usingfast_copy()
on a matrix, the padded zeros need to be taken into account correctly. Queryinternal_size1()
andinternal_size2()
to do so.
This means that the elements of the viennacl::matrix
are not contiguous, contrarily to the ones in the std::vector
you use to simulate a matrix. Therefore, this line does not do what you expect:
viennacl::fast_copy(&(stl_A[0]), &(stl_A[0]) + stl_A.size(), vcl_A);
2. The solution:
So, how to properly copy a host matrix to a ViennaCL matrix?
A possibility is to use a std::vector<std::vector<float>>
to represent the host matrix and then use viennacl::copy
instead of vienna::fast_copy
, and the padding of elements will be taken care of for you.
std::vector<std::vector<float>> stl_A(Ny);
for (unsigned int i = 0; i < Ny; ++i) {
stl_A[i].resize(Nx);
for (unsigned int j = 0; j < Nx; ++j)
stl_A[i][j] = (float)(rand() % 100);
}
viennacl::copy(stl_A, vcl_A);
Another possibility, as suggested in the documentation, is to match the internal layout of a viennacl::matrix
in your host matrix, by using internal_size
instead of Nx
and Ny
when calculating element offsets (but not iterating over them).
std::vector<float> stl_A(vcl_A.internal_size());
for (unsigned int i = 0; i < Ny; ++i)
for (unsigned int j = 0; j < Nx; ++j)
stl_A[i*vcl_A.internal_size2() + j] = (float)(rand() % 100);
viennacl::fast_copy(&(stl_A[0]), &(stl_A[0]) + stl_A.size(), vcl_A);
3. The note:
Both code examples provided above are for row-major matrices. For column-major matrices, swap the loops and use internal_size1()
instead.