python machine-learning regression theano theano-cuda

Theano simple linear regression runs on CPU instead of GPU

I created a simple python script (using Theano) performing linear regression which should be run on GPU. When code starts it says "using gpu device", but (according to the profiler) all operations are CPU-specific (ElemWise, instead of GpuElemWise, no GpuFromHost etc.).

I checked the variables, THEANO_FLAGS, everything seems right and I cannot see the catch (especially when Theano tutorials with the same settings are correctly run on GPU :)).

Here is the code:

# linear regression

import numpy
import theano
import theano.tensor as T

input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)

for i in range(1000):
    train()

THEANO_FLAGS=cuda.root=/usr/local/cuda

device=gpu

floatX=float32

lib.cnmem=.5

profile=True

CUDA_LAUNCH_BLOCKING=1

Output:

Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
  Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
  Time in 1000 calls to Function.__call__: 3.348637e-02s
  Time in Function.fn.__call__: 2.419019e-02s (72.239%)
  Time in thunks: 1.839781e-02s (54.941%)
  Total compile time: 1.350801e-01s
    Number of Apply nodes: 18
    Theano Optimizer time: 1.101730e-01s
       Theano validate time: 2.029657e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
       Import time 2.320528e-03s

Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000       2   theano.tensor.basic.Dot
  12.3%    83.9%       0.002s       3.22e-07s     C     7000       7   theano.tensor.elemwise.Elemwise
   5.7%    89.6%       0.001s       3.50e-07s     C     3000       3   theano.tensor.elemwise.DimShuffle
   4.0%    93.6%       0.001s       3.65e-07s     C     2000       2   theano.tensor.subtensor.Subtensor
   3.6%    97.2%       0.001s       3.31e-07s     C     2000       2   theano.compile.ops.Shape_i
   1.7%    98.9%       0.000s       3.06e-07s     C     1000       1   theano.tensor.opt.MakeVector
   1.1%   100.0%       0.000s       2.10e-07s     C     1000       1   theano.tensor.elemwise.Sum
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000        2   dot
   4.0%    75.6%       0.001s       3.65e-07s     C     2000        2   Subtensor{int64}
   3.5%    79.1%       0.001s       6.35e-07s     C     1000        1   InplaceDimShuffle{1,0}
   3.3%    82.4%       0.001s       6.06e-07s     C     1000        1   Elemwise{mul,no_inplace}
   2.4%    84.8%       0.000s       4.38e-07s     C     1000        1   Shape_i{0}
   2.3%    87.1%       0.000s       4.29e-07s     C     1000        1   Elemwise{Composite{((i0 * i1) / i2)}}
   2.3%    89.3%       0.000s       2.08e-07s     C     2000        2   InplaceDimShuffle{x,x}
   1.8%    91.1%       0.000s       3.25e-07s     C     1000        1   Elemwise{Cast{float64}}
   1.7%    92.8%       0.000s       3.06e-07s     C     1000        1   MakeVector{dtype='int64'}
   1.5%    94.3%       0.000s       2.78e-07s     C     1000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
   1.4%    95.7%       0.000s       2.53e-07s     C     1000        1   Elemwise{Sub}[(0, 1)]
   1.2%    96.9%       0.000s       2.24e-07s     C     1000        1   Shape_i{1}
   1.1%    98.0%       0.000s       2.10e-07s     C     1000        1   Sum{acc_dtype=float64}
   1.1%    99.1%       0.000s       1.98e-07s     C     1000        1   Elemwise{Sqr}[(0, 0)]
   0.9%   100.0%       0.000s       1.66e-07s     C     1000        1   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  37.8%    37.8%       0.007s       6.95e-06s   1000     3   dot(<TensorType(float64, matrix)>, training-set.T)
  33.9%    71.7%       0.006s       6.24e-06s   1000    14   dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
   3.5%    75.1%       0.001s       6.35e-07s   1000     0   InplaceDimShuffle{1,0}(training-set)
   3.3%    78.4%       0.001s       6.06e-07s   1000    11   Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
   3.0%    81.4%       0.001s       5.58e-07s   1000     8   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
   2.4%    83.8%       0.000s       4.38e-07s   1000     2   Shape_i{0}(expected)
   2.3%    86.2%       0.000s       4.29e-07s   1000    12   Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
   1.8%    87.9%       0.000s       3.25e-07s   1000     6   Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
   1.7%    89.6%       0.000s       3.06e-07s   1000     4   MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
   1.6%    91.2%       0.000s       3.03e-07s   1000    10   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   1.5%    92.7%       0.000s       2.78e-07s   1000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
   1.4%    94.1%       0.000s       2.53e-07s   1000     5   Elemwise{Sub}[(0, 1)](expected, dot.0)
   1.2%    95.3%       0.000s       2.24e-07s   1000     1   Shape_i{1}(expected)
   1.1%    96.5%       0.000s       2.10e-07s   1000    15   Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
   1.1%    97.6%       0.000s       1.98e-07s   1000    13   Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
   0.9%    98.5%       0.000s       1.72e-07s   1000     7   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
   0.9%    99.4%       0.000s       1.66e-07s   1000    17   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
   0.6%   100.0%       0.000s       1.13e-07s   1000     9   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Solution

As mentioned in the comments although you have set the allow_input_downcast parameter to True, but you need to make sure all the data to be assigned to shared variables are in float32. As of Jan. 06, 2016 Theano still cannot work with any other data type rather than float32 to do the computations on GPU, as mentioned here in a more details. So you have to have to cast your data into 'float32' format.

Therefore, here should be the code you need to use:

import numpy
import theano
import theano.tensor as T


input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data.astype('float32'), "training-set")
E = theano.shared(output_data.astype('float32'), "expected")
W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)

for i in range(1000):
    train()

train.profile.print_summary()

And here will be the profiling result:

Message: LearnTheano.py:18
  Time in 1000 calls to Function.__call__: 2.642968e-01s
  Time in Function.fn.__call__: 2.460811e-01s (93.108%)
  Time in thunks: 1.877530e-01s (71.039%)
  Total compile time: 2.483290e+01s
    Number of Apply nodes: 17
    Theano Optimizer time: 2.818849e-01s
       Theano validate time: 3.435850e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 2.453926e+01s
       Import time 1.241469e-02s

Time in all call to theano.grad() 1.206994e-02s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  34.8%    34.8%       0.065s       3.27e-05s     C     2000       2   theano.sandbox.cuda.blas.GpuGemm
  28.8%    63.5%       0.054s       1.80e-05s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuElemwise
  12.9%    76.4%       0.024s       2.42e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
  10.3%    86.7%       0.019s       1.93e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuFromHost
   7.2%    93.9%       0.014s       1.36e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   1.8%    95.7%       0.003s       1.13e-06s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   1.5%    97.2%       0.003s       2.81e-06s     C     1000       1   theano.tensor.elemwise.Elemwise
   1.1%    98.4%       0.002s       1.08e-06s     C     2000       2   theano.compile.ops.Shape_i
   1.1%    99.5%       0.002s       1.02e-06s     C     2000       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.5%   100.0%       0.001s       9.96e-07s     C     1000       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  25.3%    25.3%       0.047s       4.74e-05s     C     1000        1   GpuGemm{no_inplace}
  12.9%    38.1%       0.024s       2.42e-05s     C     1000        1   GpuCAReduce{pre=sqr,red=add}{1,1}
  12.8%    51.0%       0.024s       2.41e-05s     C     1000        1   GpuElemwise{mul,no_inplace}
  10.3%    61.3%       0.019s       1.93e-05s     C     1000        1   GpuFromHost
   9.5%    70.8%       0.018s       1.79e-05s     C     1000        1   GpuGemm{inplace}
   8.2%    79.0%       0.015s       1.55e-05s     C     1000        1   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   7.7%    86.7%       0.014s       1.44e-05s     C     1000        1   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
   7.2%    93.9%       0.014s       1.36e-05s     C     1000        1   HostFromGpu
   1.5%    95.4%       0.003s       2.81e-06s     C     1000        1   Elemwise{Cast{float32}}
   1.1%    96.5%       0.002s       1.02e-06s     C     2000        2   GpuSubtensor{int64}
   1.0%    97.5%       0.002s       9.00e-07s     C     2000        2   GpuDimShuffle{x,x}
   0.8%    98.3%       0.002s       1.59e-06s     C     1000        1   GpuDimShuffle{1,0}
   0.7%    99.1%       0.001s       1.38e-06s     C     1000        1   Shape_i{0}
   0.5%    99.6%       0.001s       9.96e-07s     C     1000        1   MakeVector
   0.4%   100.0%       0.001s       7.76e-07s     C     1000        1   Shape_i{1}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  25.3%    25.3%       0.047s       4.74e-05s   1000     3   GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0})
  12.9%    38.1%       0.024s       2.42e-05s   1000     5   GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0)
  12.8%    51.0%       0.024s       2.41e-05s   1000    13   GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0)
  10.3%    61.3%       0.019s       1.93e-05s   1000     7   GpuFromHost(Elemwise{Cast{float32}}.0)
   9.5%    70.8%       0.018s       1.79e-05s   1000    16   GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0})
   8.2%    79.0%       0.015s       1.55e-05s   1000    12   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0)
   7.7%    86.7%       0.014s       1.44e-05s   1000    15   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0)
   7.2%    93.9%       0.014s       1.36e-05s   1000    14   HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0)
   1.5%    95.4%       0.003s       2.81e-06s   1000     6   Elemwise{Cast{float32}}(MakeVector.0)
   0.8%    96.3%       0.002s       1.59e-06s   1000     0   GpuDimShuffle{1,0}(training-set)
   0.7%    97.0%       0.001s       1.38e-06s   1000     2   Shape_i{0}(expected)
   0.7%    97.7%       0.001s       1.30e-06s   1000     8   GpuSubtensor{int64}(GpuFromHost.0, Constant{0})
   0.6%    98.3%       0.001s       1.08e-06s   1000    11   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
   0.5%    98.8%       0.001s       9.96e-07s   1000     4   MakeVector(Shape_i{0}.0, Shape_i{1}.0)
   0.4%    99.2%       0.001s       7.76e-07s   1000     1   Shape_i{1}(expected)
   0.4%    99.6%       0.001s       7.40e-07s   1000     9   GpuSubtensor{int64}(GpuFromHost.0, Constant{1})
   0.4%   100.0%       0.001s       7.25e-07s   1000    10   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)