cublas: same input and output matrix for better performance?

I see CUBLAS may be an efficient algorithm package for a single large matrices multiplication or addition etc. But in a common setting, most computations are dependent. So the next step relies on the result of the previous step.

This causes one problem, because the output matrix has to be different from input matrix in CUBLAS routine( as input matrices are const ), much time are spend to malloc space and copy data from device to device for these temporary matrices.

So is it possible to do things like multiply(A, A, B), where the first argument is ouput matrix and the second/third are input matrices, to avoid extra memory manipulation time? Or is there a better workaround?

Thanks a lot !

Solution

No, it is not possible to perform in-place operations like gemm using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).

Having said that, this comment:

.... much time are spend to malloc space and copy data from device to device for these temporary matrices.

makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:

// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B

can be replaced by

multiply(C, A, B);
float * tmp = A; A = C; C = tmp;

ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.

If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.