I am interested in knowing how cublasSgemm
/clAmdBlasSgemm
routines are mapped on GPU while calculating matrix multiplication (C = A * B).
Assume the dimensions of input Matrix ::A_rows = 6144; A_cols = 12288; B_rows = 12288; B_cols = 15360;
and dimensions of resultant matrix :: C_rows = 6144; C_cols = 15360;
Assume i have initialized the input matrices on host and i copied the matrix data into device memory. After that i am calling following cuBlas
or clAmdBlas
routines to do matrix multiplication on GPU.
void cublasSgemm (char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const float *B, int ldb, float beta, float *C, int ldc);
where m = A_rows; and n = B_cols;
So my doubts are:
1. ) How these routines are implemented on GPU ?
2. ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
3. ) Do we have control of threads/Blocks ?
For the host side CUBLAS API (note that I have no idea why you would assume that clAmdBlasSgemm would be the same), the short answer to your questions are as follows:
Note that there is also a CUBLAS device API for K20 Kepler devices, and the answers I provided above do not apply to that library.