Search code examples
cudacublas

Reducing matrix rows or columns in CUDA


I'm using CUDA with cuBLAS to perform matrix operations.

I need to sum the rows (or columns) of a matrix. Currently I'm doing it by multiplying the matrix with a ones vector but this doesn't seem so efficient.

Is there any better way? Couldn't find anything in cuBLAS.


Solution

  • Actually multiplying the matrix with a ones vector using cublas_gemv() is a very efficient way, unless you are considering write your own kernel by hand.

    You can easily profile the mem bandwidth of cublas_gemv(). It's very close to that of simply reading the whole matrix data once, which can be seen as the theoretical peak performance of matrix row/col summation.

    The extra operation "x1.0" won't lead to much performance reduction because:

    1. cublas_gemv() is basically a mem bandwidth bound operation, extra arithmetic instructions won't be the bottleneck;
    2. FMA instruction further reduce the instruction throughput;
    3. mem of ones vector is usually much smaller than that of the matrix, and can be easily cache by GPU to reduce to mem bandwidth.

    cublas_gemv() also help you deal with the matrix layout problem. It works on row/col-major and arbitrary padding.

    I also asked a similar question about this. My experiment shows cublas_gemv() is better than segmented reduce using Thrust::reduce_by_key, which is another approach of matrix row summation.