Reducing matrix rows or columns in CUDA

I'm using CUDA with cuBLAS to perform matrix operations.

I need to sum the rows (or columns) of a matrix. Currently I'm doing it by multiplying the matrix with a ones vector but this doesn't seem so efficient.

Is there any better way? Couldn't find anything in cuBLAS.

Solution

Actually multiplying the matrix with a ones vector using cublas_gemv() is a very efficient way, unless you are considering write your own kernel by hand.

You can easily profile the mem bandwidth of cublas_gemv(). It's very close to that of simply reading the whole matrix data once, which can be seen as the theoretical peak performance of matrix row/col summation.

The extra operation "x1.0" won't lead to much performance reduction because:

cublas_gemv() is basically a mem bandwidth bound operation, extra arithmetic instructions won't be the bottleneck;
FMA instruction further reduce the instruction throughput;
mem of ones vector is usually much smaller than that of the matrix, and can be easily cache by GPU to reduce to mem bandwidth.

cublas_gemv() also help you deal with the matrix layout problem. It works on row/col-major and arbitrary padding.

I also asked a similar question about this. My experiment shows cublas_gemv() is better than segmented reduce using Thrust::reduce_by_key, which is another approach of matrix row summation.