I have some subtensor and for some reason, Theano cannot transfer it to the GPU.
Some sample code:
import numpy
import theano
import theano.printing
import theano.compile.io
import theano.compile.function_module
import theano.tensor as T
from theano.sandbox.cuda.basic_ops import as_cuda_ndarray_variable
n_copies, n_cells = 5, 10
P = T.constant(numpy.zeros((n_copies, n_cells), dtype="int32")) # (n_copies,n_cells) -> list of indices
meminkey = T.fmatrix() # (batch,n_cells)
meminkey = as_cuda_ndarray_variable(meminkey)
i_t = T.ones((meminkey.shape[0],))
batches = T.arange(0, i_t.shape[0]).dimshuffle(0, 'x', 'x') # (batch,n_copies,n_cells)
P_bc = P.dimshuffle('x', 0, 1) # (batch,n_copies,n_cells)
meminkeyP = meminkey[batches, P_bc] # (batch,n_copies,n_cells)
meminkeyP = as_cuda_ndarray_variable(meminkeyP)
func = theano.function(inputs=[meminkey], outputs=[meminkeyP])
theano.printing.debugprint(func)
I added some as_cuda_ndarray_variable
to make the problem more clear because in the output, you see the transfers GpuFromHost
and HostFromGpu
, which it would avoid if it could do the AdvancedSubtensor on GPU. Output.
Using gpu device 0: GeForce GTX TITAN (CNMeM is disabled, CuDNN not available)
GpuFromHost [id A] '' 5
|AdvancedSubtensor [id B] '' 4
|HostFromGpu [id C] '' 1
| |<CudaNdarrayType(float32, matrix)> [id D]
|InplaceDimShuffle{0,x,x} [id E] '' 3
| |ARange{dtype='int64'} [id F] '' 2
| |TensorConstant{0} [id G]
| |Shape_i{0} [id H] '' 0
| | |<CudaNdarrayType(float32, matrix)> [id D]
| |TensorConstant{1} [id I]
|TensorConstant{[[[4 0 1 2..5 8 9 7]]]} [id J]
So, why is Theano not able to transform this into a GPU op?
Also, how can I rewrite the code that Theano will do the calculation on GPU?
Ok, so in the Google Groups posts which I linked, it's pretty good explained why it doesn't work. AdvancedSubtensor is the most generic form which works with all crazy types of indexing variants. Then there is AdvancedSubtensor1, which only works for a certain kind of subset. There only exists a GPU version for AdvancedSubtensor1, not for AdvancedSubtensor. I didn't fully understand the reason but there is an ongoing discussion about it.
AdvancedSubtensor1 can be used when there is a single list of indices. However, in my example, that is not the case. The common workaround you see, also in some other example in those Google Groups post, is to flatten the array first and calculate the indices for the flattened array.
Most examples work with some kind of nonzero()
or so, where you also would flatten the base arguments in the same and then you get the indices for the flattened version.
So, the question is, how to apply this to my code?
Actually, there is a simpler solution where it will use AdvancedSubtensor1 which I didn't realized initially:
meminkeyP = meminkey[:, P] # (batch,n_copies,n_cells)
However, before I realized that, I came up with a generic solution which also works for other cases. I transform my indices tuple (batches, P_bc)
into indices for the flattened version. This is done with this function:
def indices_in_flatten_array(ndim, shape, *args):
"""
We expect that all args can be broadcasted together.
So, if we have some array A with ndim&shape as given,
A[args] would give us a subtensor.
We return the indices so that A[args].flatten()
and A.flatten()[indices] are the same.
"""
assert ndim > 0
assert len(args) == ndim
indices_per_axis = [args[i] for i in range(ndim)]
for i in range(ndim):
for j in range(i + 1, ndim):
indices_per_axis[i] *= shape[j]
indices = indices_per_axis[0]
for i in range(1, ndim):
indices += indices_per_axis[i]
return indices
Then, I use it like this:
meminkeyP = meminkey.flatten()[indices_in_flatten_array(meminkey.ndim, meminkey.shape, batches, P_bc)]
This seems to work.
And I get this output:
Using gpu device 0: GeForce GTX TITAN (CNMeM is disabled, CuDNN not available)
GpuReshape{3} [id A] '' 11
|GpuAdvancedSubtensor1 [id B] '' 10
| |GpuReshape{1} [id C] '' 2
| | |<CudaNdarrayType(float32, matrix)> [id D]
| | |TensorConstant{(1,) of -1} [id E]
| |Reshape{1} [id F] '' 9
| |Elemwise{second,no_inplace} [id G] '' 8
| | |TensorConstant{(1, 5, 10) of 0} [id H]
| | |Elemwise{Mul}[(0, 0)] [id I] '' 7
| | |InplaceDimShuffle{0,x,x} [id J] '' 6
| | | |ARange{dtype='int64'} [id K] '' 4
| | | |TensorConstant{0} [id L]
| | | |Shape_i{0} [id M] '' 0
| | | | |<CudaNdarrayType(float32, matrix)> [id D]
| | | |TensorConstant{1} [id N]
| | |InplaceDimShuffle{x,x,x} [id O] '' 5
| | |Shape_i{1} [id P] '' 1
| | |<CudaNdarrayType(float32, matrix)> [id D]
| |TensorConstant{(1,) of -1} [id E]
|MakeVector{dtype='int64'} [id Q] '' 3
|Shape_i{0} [id M] '' 0
|TensorConstant{5} [id R]
|TensorConstant{10} [id S]
Small test case:
def test_indices_in_flatten_array():
n_copies, n_cells = 5, 4
n_complex_cells = n_cells / 2
n_batch = 3
static_rng = numpy.random.RandomState(1234)
def make_permut():
p = numpy.zeros((n_copies, n_cells), dtype="int32")
for i in range(n_copies):
p[i, :n_complex_cells] = static_rng.permutation(n_complex_cells)
# Same permutation for imaginary part.
p[i, n_complex_cells:] = p[i, :n_complex_cells] + n_complex_cells
return T.constant(p)
P = make_permut() # (n_copies,n_cells) -> list of indices
meminkey = T.as_tensor_variable(static_rng.rand(n_batch, n_cells).astype("float32"))
i_t = T.ones((meminkey.shape[0],)) # (batch,)
n_batch = i_t.shape[0]
batches = T.arange(0, n_batch).dimshuffle(0, 'x', 'x') # (batch,n_copies,n_cells)
P_bc = P.dimshuffle('x', 0, 1) # (batch,n_copies,n_cells)
meminkeyP1 = meminkey[batches, P_bc] # (batch,n_copies,n_cells)
meminkeyP2 = meminkey.flatten()[indices_in_flatten_array(meminkey.ndim, meminkey.shape, batches, P_bc)]
numpy.testing.assert_allclose(meminkeyP1.eval(), meminkeyP2.eval())