cublasSgemm row-major multiplication

I'm trying to use cublasSgemm to multiply two non-square matrices that are stored in row-major order. I know that this function has one parameter where you can specify that if you want to transpose the matrices (CUBLAS_OP_T) but the result is stored in column-major order and I need it in row-major too.

Also,the code I have is not capable of multiplying non-square matrices with the parameter CUBLAS_OP_T. Only square or non-square with CUBLAS_OP_N.

Besides, I know that there are the option to declare the matrices in column-order with

define IDX2C(i,j,ld) (((j)*(ld))+(i))

but this isn't an option because the matrices that I have to use will be set in another program.

I suppose that there is a lot of information on the internet, but I'm not able to find any answer to my question.

my code is the following:


    int m = 2;
    int k = 3;
    int n = 4;
    int print = 1;
    cudaError_t cudaStat; // cudaMalloc status
    cublasStatus_t stat; // CUBLAS functions status
    cublasHandle_t handle; // CUBLAS context

    int i,j;

    float *a, *b,*c;
    
    //malloc for a,b,c...

    // define a mxk matrix a row by row
    int ind =11;                               
    for(j=0;j<m*k;j++){                     
        a[j]=(float)ind++;                                    
    }                                        
                                             
    // define a kxn matrix b column by column
    ind =11;                               
    for(j=0;j<k*n;j++){                     
        b[j]=(float)ind++;                                    
    }                                        


    // DEVICE
    float *d_a, *d_b, *d_c; 
 
    //cudaMalloc for d_a, d_b, d_c...

    stat = cublasCreate(&handle); // initialize CUBLAS context

    stat = cublasSetMatrix(m,k, sizeof(*a), a, m, d_a, m);
    stat = cublasSetMatrix(k,n, sizeof(*b), b, k, d_b, k); 

    float al =1.0f;
    float bet =0.0f; 

    stat=cublasSgemm(handle,CUBLAS_OP_T,CUBLAS_OP_T,m,n,k,&al,d_a,m,d_b,k,&bet,d_c,m);

    stat = cublasGetMatrix (m,n, sizeof (*c) ,d_c ,m,c,m); // cp d_c - >c

    if(print == 1) {
    printf ("\nc after Sgemm :\n");
        for(i=0;i<m*n;i++){
                printf ("%f ",c[i]); // print c after Sgemm
        }
    }

    cudaFree (d_a); 
    cudaFree (d_b); 
    cudaFree (d_c);
    cublasDestroy (handle); // destroy CUBLAS context
    free (a); 
    free (b); 
    free (c); 

    return EXIT_SUCCESS ;
}

The output is the transpose of multiplying A * B, that is: (A * B)T.

But what I want is C = A * B in row-major order.

I hope someone can help me.

Solution

As you said, cuBLAS interprets matrices as column-major ordered, so when you execute cublasSgemm(handle,CUBLAS_OP_T,CUBLAS_OP_T,m,n,k,&al,d_a,m,d_b,k,&bet,d_c,m), you are correctly transposing each input (which was created in row-major form) in preparation for the column-major interpretation. The problem is that cuBLAS also dumps the result in column-major order.

We will trick cuBLAS into computing $(AB)^T$ , which will be outputted in column major order and will thus look like $AB$ when we slyly interpret it in row-major order. So instead of computing AB = C, we do $B^T A^T$ = $C^T$ . Luckily, $B^T$ and $A^T$ we already obtained by the very action of creating A and B in row-major order, so we can simply bypass the transposition with CUBLAS_OP_N. So change the line to cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,n,m,k,&al,d_b,n,d_a,k,&bet,d_c,n).

The example code you supplied should calculate

$\begin{bmatrix} 11 & 12 & 13\\ 14 & 15 & 16 \end{bmatrix} \times \begin{bmatrix} 11 & 12 & 13 & 14\\ 15 & 16 & 17 & 18\\ 19 & 20 & 21 & 22 \end{bmatrix} = \begin{bmatrix} 548 & 584 & 620 & 656\\ 683 & 728 & 773 & 818 \end{bmatrix}$

and after running with the updated call to cublasSgemm, we get:

c after Sgemm :
548.000000 584.000000 620.000000 656.000000 683.000000 728.000000 773.000000 818.000000