matrix cuda matrix-multiplication cublas

CUDA cublasSgemm matrix multiplication in specific format

I would like to multiply two matrices A and B using CUDA's cublasSgemm function.

However, A is in row-major format and B is in column-major format and I want the output to be in row-major format.

Is that possible - and if so - how do I have to set these parameters

cublasHandle_t handle,
cublasOperation_t transa, cublasOperation_t transb,
int m, int n, int k,
const float     *alpha,
const float     *A, int lda,
const float     *B, int ldb,
const float     *beta,
float           *C, int ldc

if e.g. A=1x4 and B=4x16 such that C=1x16?

Solution

The example that you give doesn't make much sense because it's a matrix-vector product, so I pick my own example to make it clearer:

C = A * B

A = [8, 4] row major, meaning [4, 8] when interpreted column-major
B = [4, 6] column major
C = [8, 6] row major, meaning [6, 8] when interpreted column-major

CuBLAS uses column-major ordering, so talking about row-major order is distracting. It's better to say that they are transposed. This gives us A = [4, 8], B = [4, 6], C = [6, 8]. That makes it easy to see that you want C = transpose(B) * A

In CuBLAS, that would be

cublasSgemm(handle,
            CUBLAS_OP_T /*transpose left side*/,
            CUBLAS_OP_N /* no transposition right side*/,
            6 /*rows in C*/, 8 /*columns in C*/, 4 /* rows on right side*/,
            &alpha /*1.f*/, B /*left side*/, ld_B,
            A /*right side*/, ld_A, &beta /*0.f*/,
            C, ld_C);

ld_A, ld_B, and ld_C are the leading dimensions, a.k.a. the outer dimensions of the A, B, and C matrices. Also known as the outer stride, it is the number of elements between successive columns in each of the matrices (since they are column major). Due to padding, or using slices of larger matrices, this can be larger than the "logical" number of rows.

In other words, if we have float* A and A[0] is the first row, first column, then A[ld_A] is the first row, second column and A[ld_A + 1] is the second row, second column.

If you use cudaMalloc3D to allocate suitably padded matrices, it's pitch / sizeof(float) using the attribute of the cudaPitchedPtr