I'm building a kernel which among other things uses the Magma function magma_dgeqrf2_gpu to perform a QR factorization. This outputs the upper triangular matrix R into a general matrix d_A on the GPU device.
Without transferring d_A back to host (since I need the GPU for further operations), is there a lib way to just reduce or extract the d_A into an upper triangular matrix R on the device?
It's a bit silly but I found the solution was simply to use magmablas_dlacopy() and set the attributes to copy the upper triangular matrix to another matrix (which has been set to 0) on the device.