I've made the following CUDA tests to compare the performance numbers of (square) matrix multiplication, running on Ubuntu 24.04 with the GPU card Quadro T1000 Mobile of compute capability 7.5 (arch=SM_75) and the GPU driver nvidia-driver-535 .
Remark
Configuration
Test Lib Cores
---- --- ----
mat_mul_custom cuda_runtime v12.0.1 cuda-cores
mat_mul_blas cublas_v2 v12.0.1 cuda-cores
mat_mul_tensor cutensor v2.0.2.4 dedicated-cuda-cores (non-tensor-cores)
n/a cutensor tensor-cores
Mat_mul_test.cu
#include <cuda_runtime.h>
#define STRIDE 1024
#define SUB_STRIDE 32
__global__ void mat_mul_custom_d(float*, float*, float*);
__global__ void mat_mul_custom_d(float* A, float* B, float* C) {
int by = blockIdx.y, bx = blockIdx.x, ty = threadIdx.y, tx = threadIdx.x;
int aBegin = by * SUB_STRIDE * STRIDE + ty * STRIDE + tx,
aEnd = aBegin + STRIDE, bBegin = SUB_STRIDE * bx + ty * STRIDE + tx,
bStep = SUB_STRIDE * STRIDE;
float sC = 0;
for (int a = aBegin, b = bBegin; a < aEnd; a += SUB_STRIDE, b += bStep) {
__shared__ float As[SUB_STRIDE][SUB_STRIDE], Bs[SUB_STRIDE][SUB_STRIDE];
As[ty][tx] = A[a];
Bs[ty][tx] = B[b];
__syncthreads();
for (int k = 0; k < SUB_STRIDE; ++k) sC += As[ty][k] * Bs[k][tx];
__syncthreads();
}
C[by * SUB_STRIDE * STRIDE + SUB_STRIDE * bx + ty * STRIDE + tx] = sC;
}
void mat_mul_custom(float*, float*, float*);
void mat_mul_custom(float* A, float* B, float* C) {
dim3 block(SUB_STRIDE, SUB_STRIDE);
dim3 grid(STRIDE / SUB_STRIDE, STRIDE / SUB_STRIDE);
mat_mul_custom_d<<<grid, block>>>(A, B, C);
}
#include <cublas_v2.h>
#define ALPHA 1
#define BETA 0
void mat_mul_blas(float*, float*, float*);
void mat_mul_blas(float* A, float* B, float* C) {
const float alpha = ALPHA, beta = BETA;
cublasHandle_t cublasH;
cublasCreate(&cublasH);
cublasSgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_N, STRIDE, STRIDE, STRIDE,
&alpha, A, STRIDE, B, STRIDE, &beta, C, STRIDE);
}
#include <cutensor.h>
#define N_MODES 2
#define K_ALIGNMENT 128
void mat_mul_tensor(float*, float*, float*);
void mat_mul_tensor(float* A, float* B, float* C) {
const float alpha = ALPHA, beta = BETA;
const int modeC[] = {'i', 'j'}, modeA[] = {'i', 'k'}, modeB[] = {'k', 'j'};
const int64_t extentC[] = {STRIDE, STRIDE}, extentA[] = {STRIDE, STRIDE},
extentB[] = {STRIDE, STRIDE};
cutensorHandle_t handle;
cutensorCreate(&handle);
cutensorTensorDescriptor_t descA, descB, descC;
cutensorCreateTensorDescriptor(handle, &descA, N_MODES, extentA, 0,
CUTENSOR_R_32F, K_ALIGNMENT);
cutensorCreateTensorDescriptor(handle, &descB, N_MODES, extentB, 0,
CUTENSOR_R_32F, K_ALIGNMENT);
cutensorCreateTensorDescriptor(handle, &descC, N_MODES, extentC, 0,
CUTENSOR_R_32F, K_ALIGNMENT);
cutensorOperationDescriptor_t desc;
cutensorCreateContraction(handle, &desc, descA, modeA, CUTENSOR_OP_IDENTITY,
descB, modeB, CUTENSOR_OP_IDENTITY, descC, modeC,
CUTENSOR_OP_IDENTITY, descC, modeC,
CUTENSOR_COMPUTE_DESC_32F);
cutensorPlanPreference_t planPref;
cutensorCreatePlanPreference(handle, &planPref, CUTENSOR_ALGO_DEFAULT,
CUTENSOR_JIT_MODE_NONE);
cutensorPlan_t plan;
cutensorCreatePlan(handle, &plan, desc, planPref, 0);
cutensorContract(handle, plan, (void*)&alpha, A, B, (void*)&beta, C, C, 0,
0, 0);
}
#include <assert.h>
#include <stdlib.h>
#define RAND_UPPER 2
void mat_mul(void (*mat_mul_impl)(float*, float*, float*), bool);
void mat_mul(void (*mat_mul_impl)(float*, float*, float*), bool column_major) {
float *A, *B, *C;
cudaMallocManaged(&A, sizeof(float) * STRIDE * STRIDE);
cudaMallocManaged(&B, sizeof(float) * STRIDE * STRIDE);
cudaMallocManaged(&C, sizeof(float) * STRIDE * STRIDE);
for (int i = 0; i < STRIDE; i++)
for (int k = 0; k < STRIDE; k++)
A[i * STRIDE + k] = rand() % RAND_UPPER;
for (int j = 0; j < STRIDE; j++)
for (int k = 0; k < STRIDE; k++)
B[k * STRIDE + j] = rand() % RAND_UPPER;
if (column_major)
mat_mul_impl(B, A, C);
else
mat_mul_impl(A, B, C);
cudaDeviceSynchronize();
for (int i = 0; i < STRIDE; i++)
for (int j = 0; j < STRIDE; j++) {
float res = 0;
for (int k = 0; k < STRIDE; k++)
res += A[i * STRIDE + k] * B[k * STRIDE + j];
assert(res == C[i * STRIDE + j]);
}
cudaFree(A);
cudaFree(B);
cudaFree(C);
}
#include <string.h>
#include <stdio.h>
int main(int argc, char** argv) {
if (argc == 2) {
if (!strcmp(argv[1], "custom")) {
printf("Test: mat_mul_custom\n");
mat_mul(mat_mul_custom, false);
return 0;
}
if (!strcmp(argv[1], "blas")) {
printf("Test: mat_mul_blas\n");
mat_mul(mat_mul_blas, true);
return 0;
}
if (!strcmp(argv[1], "tensor")) {
printf("Test: mat_mul_tensor\n");
mat_mul(mat_mul_tensor, true);
return 0;
}
}
printf("Usage: nvprof ./Mat_mul_test [custom|blas|tensor]\n");
}
Tests
$ nvcc ./Mat_mul_test.cu \
-lcublas \
-L${CUTENSOR_ROOT}/lib/12 \
-lcutensor \
-I${CUTENSOR_ROOT}/include \
-o ./Mat_mul_test
$ nvprof ./Mat_mul_test custom
Avg Name
--- ----
15.352ms mat_mul_custom_d
Count Total Time Name
----- ---------- ----
35 4.641ms Gpu page fault groups
$ nvprof ./Mat_mul_test blas
4.628ms volta_sgemm_128x64_nn
28 3.627ms Gpu page fault groups
$ nvprof ./Mat_mul_test tensor
4.879ms volta_sgemm_128x64_nn
31 4.441ms Gpu page fault groups
The GPU under test does not have tensor cores in hardware, but has the hardware to process tensor instructions. The tests between mat_mul_blas and mat_mul_tensor exhibit almost the same performance numbers. Are they as expected? What does the kernel volta_sgemm_128x64_nn execute exactly? The test mat_mul_blas is much faster than mat_mul_custom. Is it just because the kernel volta_sgemm_128x64_nn is highly optimized, or mostly because they execute on different types of CUDA cores?
Reference
Are they as expected?
Yes, its expected that cublas will do matrix multiplication operations faster than the kernel code you have shown. It's also evident that the cutensor designers are using roughly the same approach as cublas - in fact its evident they are calling cublas in a similar fashion to your cublas call. (the durations are basically the same, and the kernel name is the same)
What does the kernel volta_sgemm_128x64_nn execute exactly?
It's a highly-optimized gemm kernel (matrix-matrix multiply) for FP32 data (which is why you see sgemm
), written by the cublas designers, as part of the closed source cublas library. Therefore I cannot provide a detailed description. However writing a highly optimized gemm kernel is a non-trivial undertaking. This writeup may give you some idea of the complexity involved, compared to your kernel code, which we now might refer to as a "naive" shared-memory matrix multiply code.
Is it just because the kernel volta_sgemm_128x64_nn is highly optimized, or mostly because they execute on different types of CUDA cores?
It is because it is highly optimized. They are not using different cores (functional units in the SM). You can get some evidence/proof of this by using the CUDA binary utilities. If you dump the SASS code in each case, you will find both your kernel and the cublas sgemm kernel are using FFMA SASS instructions to perform the matrix-multiply arithmetic.
As has already been mentioned, there are currently no CUDA GPUs that provide tensorcore calculation paths for FP32 input data, so we can completely dispense with that notion at this time; that is not what is happening here, and the fact that both your cublas call and your cutensor call are invoking a cublas non-tensorcore sgemm routine is further evidence of that.