I'm using CUDA with cuBLAS
to perform matrix operations.
I need to sum the rows (or columns) of a matrix. Currently I'm doing it by multiplying the matrix with a ones vector but this doesn't seem so efficient.
Is there any better way? Couldn't find anything in cuBLAS
.
Actually multiplying the matrix with a ones vector using cublas_gemv()
is a very efficient way, unless you are considering write your own kernel by hand.
You can easily profile the mem bandwidth of cublas_gemv()
. It's very close to that of simply reading the whole matrix data once, which can be seen as the theoretical peak performance of matrix row/col summation.
The extra operation "x1.0" won't lead to much performance reduction because:
cublas_gemv()
is basically a mem bandwidth bound operation, extra arithmetic instructions won't be the bottleneck;cublas_gemv()
also help you deal with the matrix layout problem. It works on row/col-major and arbitrary padding.
I also asked a similar question about this. My experiment shows cublas_gemv()
is better than segmented reduce using Thrust::reduce_by_key
, which is another approach of matrix row summation.