How threads/blocks are mapped on GPU while calling cublasSgemm/clAmdBlasSgemm routines?

I am interested in knowing how cublasSgemm/clAmdBlasSgemm routines are mapped on GPU while calculating matrix multiplication (C = A * B).

Assume the dimensions of input Matrix ::A_rows = 6144; A_cols = 12288; B_rows = 12288; B_cols = 15360;

and dimensions of resultant matrix :: C_rows = 6144; C_cols = 15360;

Assume i have initialized the input matrices on host and i copied the matrix data into device memory. After that i am calling following cuBlas or clAmdBlas routines to do matrix multiplication on GPU.

void cublasSgemm (char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const float *B, int ldb, float beta, float *C, int ldc);

where m = A_rows; and n = B_cols;

So my doubts are:
1. ) How these routines are implemented on GPU ?
2. ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
3. ) Do we have control of threads/Blocks ?

Solution

For the host side CUBLAS API (note that I have no idea why you would assume that clAmdBlasSgemm would be the same), the short answer to your questions are as follows:

Modern CUBLAS is closed source. There are code bases like Magma which you could look at to at least get a feel for how CUBLAS might be implemented. You can also run CUBLAS code in one of the NVIDIA supplied profilers to see what it does on the GPU. But the point is that you don't need to know how it works. There is an API and some very thorough documentation. That is all you need to know.
You example problem requires roughly 1.2Gb of memory. If you have a GPU with that much memory, and either enough computational capacity to avoid the display driver watchdog timer, or a compute dedicated GPU, it will work. Memory and the display driver time limitations (where applicable) are the only limitations.
No.

Note that there is also a CUBLAS device API for K20 Kepler devices, and the answers I provided above do not apply to that library.