Pycuda messing up numpy matrix transpose

Why does the transposed matrix look differently, when converted to a pycuda.gpuarray?

Can you reproduce this? What could cause this? Am I using the wrong approach?

Example code

from pycuda import gpuarray
import pycuda.autoinit
import numpy

data = numpy.random.randn(2,4).astype(numpy.float32)
data_gpu = gpuarray.to_gpu(data.T)
print "data\n",data
print "data_gpu.get()\n",data_gpu.get()
print "data.T\n",data.T

Output

data
[[ 0.70442784  0.08845157 -0.84840715 -1.81618035]
 [ 0.55292499  0.54911566  0.54672164  0.05098847]]
data_gpu.get()
[[ 0.70442784  0.08845157]
 [-0.84840715 -1.81618035]
 [ 0.55292499  0.54911566]
 [ 0.54672164  0.05098847]]
data.T
[[ 0.70442784  0.55292499]
 [ 0.08845157  0.54911566]
 [-0.84840715  0.54672164]
 [-1.81618035  0.05098847]]

Solution

The basic reason is that numpy transpose only creates a view, which has no effect on the underlying array storage, and it is that storage which PyCUDA directly accesses when a copy is performed to device memory. The solution is to use the copy method when doing the transpose, which will create an array with data in the transposed order in host memory, then copy that to the device:

data_gpu = gpuarray.to_gpu(data.T.copy())