I created a simple python script (using Theano) performing linear regression which should be run on GPU. When code starts it says "using gpu device", but (according to the profiler) all operations are CPU-specific (ElemWise, instead of GpuElemWise, no GpuFromHost etc.).
I checked the variables, THEANO_FLAGS, everything seems right and I cannot see the catch (especially when Theano tutorials with the same settings are correctly run on GPU :)).
Here is the code:
# linear regression
import numpy
import theano
import theano.tensor as T
input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))
O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)
for i in range(1000):
train()
- THEANO_FLAGS=cuda.root=/usr/local/cuda
- device=gpu
- floatX=float32
- lib.cnmem=.5
- profile=True
- CUDA_LAUNCH_BLOCKING=1
Output:
Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
Time in 1000 calls to Function.__call__: 3.348637e-02s
Time in Function.fn.__call__: 2.419019e-02s (72.239%)
Time in thunks: 1.839781e-02s (54.941%)
Total compile time: 1.350801e-01s
Number of Apply nodes: 18
Theano Optimizer time: 1.101730e-01s
Theano validate time: 2.029657e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
Import time 2.320528e-03s
Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 theano.tensor.basic.Dot
12.3% 83.9% 0.002s 3.22e-07s C 7000 7 theano.tensor.elemwise.Elemwise
5.7% 89.6% 0.001s 3.50e-07s C 3000 3 theano.tensor.elemwise.DimShuffle
4.0% 93.6% 0.001s 3.65e-07s C 2000 2 theano.tensor.subtensor.Subtensor
3.6% 97.2% 0.001s 3.31e-07s C 2000 2 theano.compile.ops.Shape_i
1.7% 98.9% 0.000s 3.06e-07s C 1000 1 theano.tensor.opt.MakeVector
1.1% 100.0% 0.000s 2.10e-07s C 1000 1 theano.tensor.elemwise.Sum
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
71.7% 71.7% 0.013s 6.59e-06s Py 2000 2 dot
4.0% 75.6% 0.001s 3.65e-07s C 2000 2 Subtensor{int64}
3.5% 79.1% 0.001s 6.35e-07s C 1000 1 InplaceDimShuffle{1,0}
3.3% 82.4% 0.001s 6.06e-07s C 1000 1 Elemwise{mul,no_inplace}
2.4% 84.8% 0.000s 4.38e-07s C 1000 1 Shape_i{0}
2.3% 87.1% 0.000s 4.29e-07s C 1000 1 Elemwise{Composite{((i0 * i1) / i2)}}
2.3% 89.3% 0.000s 2.08e-07s C 2000 2 InplaceDimShuffle{x,x}
1.8% 91.1% 0.000s 3.25e-07s C 1000 1 Elemwise{Cast{float64}}
1.7% 92.8% 0.000s 3.06e-07s C 1000 1 MakeVector{dtype='int64'}
1.5% 94.3% 0.000s 2.78e-07s C 1000 1 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
1.4% 95.7% 0.000s 2.53e-07s C 1000 1 Elemwise{Sub}[(0, 1)]
1.2% 96.9% 0.000s 2.24e-07s C 1000 1 Shape_i{1}
1.1% 98.0% 0.000s 2.10e-07s C 1000 1 Sum{acc_dtype=float64}
1.1% 99.1% 0.000s 1.98e-07s C 1000 1 Elemwise{Sqr}[(0, 0)]
0.9% 100.0% 0.000s 1.66e-07s C 1000 1 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
37.8% 37.8% 0.007s 6.95e-06s 1000 3 dot(<TensorType(float64, matrix)>, training-set.T)
33.9% 71.7% 0.006s 6.24e-06s 1000 14 dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
3.5% 75.1% 0.001s 6.35e-07s 1000 0 InplaceDimShuffle{1,0}(training-set)
3.3% 78.4% 0.001s 6.06e-07s 1000 11 Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
3.0% 81.4% 0.001s 5.58e-07s 1000 8 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
2.4% 83.8% 0.000s 4.38e-07s 1000 2 Shape_i{0}(expected)
2.3% 86.2% 0.000s 4.29e-07s 1000 12 Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
1.8% 87.9% 0.000s 3.25e-07s 1000 6 Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
1.7% 89.6% 0.000s 3.06e-07s 1000 4 MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
1.6% 91.2% 0.000s 3.03e-07s 1000 10 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
1.5% 92.7% 0.000s 2.78e-07s 1000 16 Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
1.4% 94.1% 0.000s 2.53e-07s 1000 5 Elemwise{Sub}[(0, 1)](expected, dot.0)
1.2% 95.3% 0.000s 2.24e-07s 1000 1 Shape_i{1}(expected)
1.1% 96.5% 0.000s 2.10e-07s 1000 15 Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
1.1% 97.6% 0.000s 1.98e-07s 1000 13 Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
0.9% 98.5% 0.000s 1.72e-07s 1000 7 Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
0.9% 99.4% 0.000s 1.66e-07s 1000 17 Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
0.6% 100.0% 0.000s 1.13e-07s 1000 9 InplaceDimShuffle{x,x}(Subtensor{int64}.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)
As mentioned in the comments although you have set the allow_input_downcast
parameter to True
, but you need to make sure all the data to be assigned to shared variables are in float32
. As of Jan. 06, 2016 Theano still cannot work with any other data type rather than float32
to do the computations on GPU, as mentioned here in a more details. So you have to have to cast your data into 'float32' format.
Therefore, here should be the code you need to use:
import numpy
import theano
import theano.tensor as T
input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
TS = theano.shared(input_data.astype('float32'), "training-set")
E = theano.shared(output_data.astype('float32'), "expected")
W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))
O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)
for i in range(1000):
train()
train.profile.print_summary()
And here will be the profiling result:
Message: LearnTheano.py:18
Time in 1000 calls to Function.__call__: 2.642968e-01s
Time in Function.fn.__call__: 2.460811e-01s (93.108%)
Time in thunks: 1.877530e-01s (71.039%)
Total compile time: 2.483290e+01s
Number of Apply nodes: 17
Theano Optimizer time: 2.818849e-01s
Theano validate time: 3.435850e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 2.453926e+01s
Import time 1.241469e-02s
Time in all call to theano.grad() 1.206994e-02s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
34.8% 34.8% 0.065s 3.27e-05s C 2000 2 theano.sandbox.cuda.blas.GpuGemm
28.8% 63.5% 0.054s 1.80e-05s C 3000 3 theano.sandbox.cuda.basic_ops.GpuElemwise
12.9% 76.4% 0.024s 2.42e-05s C 1000 1 theano.sandbox.cuda.basic_ops.GpuCAReduce
10.3% 86.7% 0.019s 1.93e-05s C 1000 1 theano.sandbox.cuda.basic_ops.GpuFromHost
7.2% 93.9% 0.014s 1.36e-05s C 1000 1 theano.sandbox.cuda.basic_ops.HostFromGpu
1.8% 95.7% 0.003s 1.13e-06s C 3000 3 theano.sandbox.cuda.basic_ops.GpuDimShuffle
1.5% 97.2% 0.003s 2.81e-06s C 1000 1 theano.tensor.elemwise.Elemwise
1.1% 98.4% 0.002s 1.08e-06s C 2000 2 theano.compile.ops.Shape_i
1.1% 99.5% 0.002s 1.02e-06s C 2000 2 theano.sandbox.cuda.basic_ops.GpuSubtensor
0.5% 100.0% 0.001s 9.96e-07s C 1000 1 theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
25.3% 25.3% 0.047s 4.74e-05s C 1000 1 GpuGemm{no_inplace}
12.9% 38.1% 0.024s 2.42e-05s C 1000 1 GpuCAReduce{pre=sqr,red=add}{1,1}
12.8% 51.0% 0.024s 2.41e-05s C 1000 1 GpuElemwise{mul,no_inplace}
10.3% 61.3% 0.019s 1.93e-05s C 1000 1 GpuFromHost
9.5% 70.8% 0.018s 1.79e-05s C 1000 1 GpuGemm{inplace}
8.2% 79.0% 0.015s 1.55e-05s C 1000 1 GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
7.7% 86.7% 0.014s 1.44e-05s C 1000 1 GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
7.2% 93.9% 0.014s 1.36e-05s C 1000 1 HostFromGpu
1.5% 95.4% 0.003s 2.81e-06s C 1000 1 Elemwise{Cast{float32}}
1.1% 96.5% 0.002s 1.02e-06s C 2000 2 GpuSubtensor{int64}
1.0% 97.5% 0.002s 9.00e-07s C 2000 2 GpuDimShuffle{x,x}
0.8% 98.3% 0.002s 1.59e-06s C 1000 1 GpuDimShuffle{1,0}
0.7% 99.1% 0.001s 1.38e-06s C 1000 1 Shape_i{0}
0.5% 99.6% 0.001s 9.96e-07s C 1000 1 MakeVector
0.4% 100.0% 0.001s 7.76e-07s C 1000 1 Shape_i{1}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
25.3% 25.3% 0.047s 4.74e-05s 1000 3 GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0})
12.9% 38.1% 0.024s 2.42e-05s 1000 5 GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0)
12.8% 51.0% 0.024s 2.41e-05s 1000 13 GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0)
10.3% 61.3% 0.019s 1.93e-05s 1000 7 GpuFromHost(Elemwise{Cast{float32}}.0)
9.5% 70.8% 0.018s 1.79e-05s 1000 16 GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0})
8.2% 79.0% 0.015s 1.55e-05s 1000 12 GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0)
7.7% 86.7% 0.014s 1.44e-05s 1000 15 GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0)
7.2% 93.9% 0.014s 1.36e-05s 1000 14 HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0)
1.5% 95.4% 0.003s 2.81e-06s 1000 6 Elemwise{Cast{float32}}(MakeVector.0)
0.8% 96.3% 0.002s 1.59e-06s 1000 0 GpuDimShuffle{1,0}(training-set)
0.7% 97.0% 0.001s 1.38e-06s 1000 2 Shape_i{0}(expected)
0.7% 97.7% 0.001s 1.30e-06s 1000 8 GpuSubtensor{int64}(GpuFromHost.0, Constant{0})
0.6% 98.3% 0.001s 1.08e-06s 1000 11 GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
0.5% 98.8% 0.001s 9.96e-07s 1000 4 MakeVector(Shape_i{0}.0, Shape_i{1}.0)
0.4% 99.2% 0.001s 7.76e-07s 1000 1 Shape_i{1}(expected)
0.4% 99.6% 0.001s 7.40e-07s 1000 9 GpuSubtensor{int64}(GpuFromHost.0, Constant{1})
0.4% 100.0% 0.001s 7.25e-07s 1000 10 GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)