How to convert an upper/lower gpuarray to the specific format required by cublasStbsv?

I am currently using pycuda and scikits.cuda to solve linear equation A*x = b, where A is an upper/lower matrix. However the cublasStbsv routine requires a specific format. To give an example: if a lower matrix A = [[1, 0, 0], [2, 3, 0], [4, 5, 6]], then the input required by cublasStbsv should be [[1, 3, 6], [2, 5, 0], [4, 0, 0]], where rows are diagonal, subdiagonal1, subdiagonal2, respectively. If using numpy, this can be easily done by stride_tricks.as_strided, but I dont know how to do similar things with pycuda.gpuarray. Any help would be appreciated, thanks. I found pycuda.compyte.array.as_strided, but it cannot be applied to gpuarray.

Solution

I got it done by using theano. First converted it to cudandarray, change stride and make a copy back to gpuarray. Just be careful about changes between Fortran and C order. update: finally got it done by using gpuarray.multi_take_put

def make_triangle(s_matrix, uplo = 'L'):
"""convert triangle matrix to the specific format
required by cublasStbsv, matrix should be in Fortran order,
s_matrix: gpuarray    
"""
#make sure the dytpe is float32     
if s_matrix.dtype != 'f':
    s_matrix = s_matrix.astype('f')
dim = s_matrix.shape[0]
if uplo == 'L':
    idx_tuple = np.tril_indices(dim)
    gidx = gpuarray.to_gpu(idx_tuple[0] + idx_tuple[1] * dim)
    gdst = gpuarray.to_gpu(idx_tuple[0] + idx_tuple[1] * (dim - 1))
    return gpuarray.multi_take_put([s_matrix], gdst, gidx, (dim, dim))[0]
else:
    idx_tuple = np.triu_indices(dim)
    gidx = gpuarray.to_gpu(idx_tuple[0] + idx_tuple[1] * dim)
    gdst = gpuarray.to_gpu(idx_tuple[0] + (idx_tuple[1] + 1) * (dim - 1))
    return gpuarray.multi_take_put([s_matrix], gdst, gidx, (dim, dim))[0]