Search code examples
pythonmachine-learningregressiontheanotheano-cuda

Theano simple linear regression runs on CPU instead of GPU


I created a simple python script (using Theano) performing linear regression which should be run on GPU. When code starts it says "using gpu device", but (according to the profiler) all operations are CPU-specific (ElemWise, instead of GpuElemWise, no GpuFromHost etc.).

I checked the variables, THEANO_FLAGS, everything seems right and I cannot see the catch (especially when Theano tutorials with the same settings are correctly run on GPU :)).

Here is the code:

# linear regression

import numpy
import theano
import theano.tensor as T

input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])

TS = theano.shared(input_data, "training-set")
E = theano.shared(output_data, "expected")
W1 = theano.shared(numpy.zeros((1, 2)))

O = T.dot(TS, W1.T)
cost = T.mean(T.sqr(E - O.T))
gradient = T.grad(cost=cost, wrt=W1)
update = [[W1, W1 - gradient * 0.0001]]
train = theano.function([], cost, updates=update, allow_input_downcast=True)

for i in range(1000):
    train()
  • THEANO_FLAGS=cuda.root=/usr/local/cuda
  • device=gpu
  • floatX=float32
  • lib.cnmem=.5
  • profile=True
  • CUDA_LAUNCH_BLOCKING=1

Output:

Using gpu device 0: GeForce GT 650M (CNMeM is enabled)
Function profiling
==================
  Message: /home/mw/Documents/LiClipse Workspace/theano1/test2.py:18
  Time in 1000 calls to Function.__call__: 3.348637e-02s
  Time in Function.fn.__call__: 2.419019e-02s (72.239%)
  Time in thunks: 1.839781e-02s (54.941%)
  Total compile time: 1.350801e-01s
    Number of Apply nodes: 18
    Theano Optimizer time: 1.101730e-01s
       Theano validate time: 2.029657e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 1.491690e-02s
       Import time 2.320528e-03s

Time in all call to theano.grad() 8.740902e-03s
Time since theano import 0.881s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000       2   theano.tensor.basic.Dot
  12.3%    83.9%       0.002s       3.22e-07s     C     7000       7   theano.tensor.elemwise.Elemwise
   5.7%    89.6%       0.001s       3.50e-07s     C     3000       3   theano.tensor.elemwise.DimShuffle
   4.0%    93.6%       0.001s       3.65e-07s     C     2000       2   theano.tensor.subtensor.Subtensor
   3.6%    97.2%       0.001s       3.31e-07s     C     2000       2   theano.compile.ops.Shape_i
   1.7%    98.9%       0.000s       3.06e-07s     C     1000       1   theano.tensor.opt.MakeVector
   1.1%   100.0%       0.000s       2.10e-07s     C     1000       1   theano.tensor.elemwise.Sum
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  71.7%    71.7%       0.013s       6.59e-06s     Py    2000        2   dot
   4.0%    75.6%       0.001s       3.65e-07s     C     2000        2   Subtensor{int64}
   3.5%    79.1%       0.001s       6.35e-07s     C     1000        1   InplaceDimShuffle{1,0}
   3.3%    82.4%       0.001s       6.06e-07s     C     1000        1   Elemwise{mul,no_inplace}
   2.4%    84.8%       0.000s       4.38e-07s     C     1000        1   Shape_i{0}
   2.3%    87.1%       0.000s       4.29e-07s     C     1000        1   Elemwise{Composite{((i0 * i1) / i2)}}
   2.3%    89.3%       0.000s       2.08e-07s     C     2000        2   InplaceDimShuffle{x,x}
   1.8%    91.1%       0.000s       3.25e-07s     C     1000        1   Elemwise{Cast{float64}}
   1.7%    92.8%       0.000s       3.06e-07s     C     1000        1   MakeVector{dtype='int64'}
   1.5%    94.3%       0.000s       2.78e-07s     C     1000        1   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)]
   1.4%    95.7%       0.000s       2.53e-07s     C     1000        1   Elemwise{Sub}[(0, 1)]
   1.2%    96.9%       0.000s       2.24e-07s     C     1000        1   Shape_i{1}
   1.1%    98.0%       0.000s       2.10e-07s     C     1000        1   Sum{acc_dtype=float64}
   1.1%    99.1%       0.000s       1.98e-07s     C     1000        1   Elemwise{Sqr}[(0, 0)]
   0.9%   100.0%       0.000s       1.66e-07s     C     1000        1   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  37.8%    37.8%       0.007s       6.95e-06s   1000     3   dot(<TensorType(float64, matrix)>, training-set.T)
  33.9%    71.7%       0.006s       6.24e-06s   1000    14   dot(Elemwise{Composite{((i0 * i1) / i2)}}.0, training-set)
   3.5%    75.1%       0.001s       6.35e-07s   1000     0   InplaceDimShuffle{1,0}(training-set)
   3.3%    78.4%       0.001s       6.06e-07s   1000    11   Elemwise{mul,no_inplace}(InplaceDimShuffle{x,x}.0, InplaceDimShuffle{x,x}.0)
   3.0%    81.4%       0.001s       5.58e-07s   1000     8   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{1})
   2.4%    83.8%       0.000s       4.38e-07s   1000     2   Shape_i{0}(expected)
   2.3%    86.2%       0.000s       4.29e-07s   1000    12   Elemwise{Composite{((i0 * i1) / i2)}}(TensorConstant{(1, 1) of -2.0}, Elemwise{Sub}[(0, 1)].0, Elemwise{mul,no_inplace}.0)
   1.8%    87.9%       0.000s       3.25e-07s   1000     6   Elemwise{Cast{float64}}(MakeVector{dtype='int64'}.0)
   1.7%    89.6%       0.000s       3.06e-07s   1000     4   MakeVector{dtype='int64'}(Shape_i{0}.0, Shape_i{1}.0)
   1.6%    91.2%       0.000s       3.03e-07s   1000    10   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   1.5%    92.7%       0.000s       2.78e-07s   1000    16   Elemwise{Composite{(i0 - (i1 * i2))}}[(0, 0)](<TensorType(float64, matrix)>, TensorConstant{(1, 1) of ..974738e-05}, dot.0)
   1.4%    94.1%       0.000s       2.53e-07s   1000     5   Elemwise{Sub}[(0, 1)](expected, dot.0)
   1.2%    95.3%       0.000s       2.24e-07s   1000     1   Shape_i{1}(expected)
   1.1%    96.5%       0.000s       2.10e-07s   1000    15   Sum{acc_dtype=float64}(Elemwise{Sqr}[(0, 0)].0)
   1.1%    97.6%       0.000s       1.98e-07s   1000    13   Elemwise{Sqr}[(0, 0)](Elemwise{Sub}[(0, 1)].0)
   0.9%    98.5%       0.000s       1.72e-07s   1000     7   Subtensor{int64}(Elemwise{Cast{float64}}.0, Constant{0})
   0.9%    99.4%       0.000s       1.66e-07s   1000    17   Elemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](Sum{acc_dtype=float64}.0, Subtensor{int64}.0, Subtensor{int64}.0)
   0.6%   100.0%       0.000s       1.13e-07s   1000     9   InplaceDimShuffle{x,x}(Subtensor{int64}.0)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Solution

  • As mentioned in the comments although you have set the allow_input_downcast parameter to True, but you need to make sure all the data to be assigned to shared variables are in float32. As of Jan. 06, 2016 Theano still cannot work with any other data type rather than float32 to do the computations on GPU, as mentioned here in a more details. So you have to have to cast your data into 'float32' format.

    Therefore, here should be the code you need to use:

    import numpy
    import theano
    import theano.tensor as T
    
    
    input_data = numpy.matrix([[28, 1], [35, 2], [18, 1], [56, 2], [80, 3]])
    output_data = numpy.matrix([1600, 2100, 1400, 2500, 3200])
    
    TS = theano.shared(input_data.astype('float32'), "training-set")
    E = theano.shared(output_data.astype('float32'), "expected")
    W1 = theano.shared(numpy.zeros((1, 2), dtype = 'float32'))
    
    O = T.dot(TS, W1.T)
    cost = T.mean(T.sqr(E - O.T))
    gradient = T.grad(cost=cost, wrt=W1)
    update = [[W1, W1 - gradient * 0.0001]]
    train = theano.function([], cost, updates=update, allow_input_downcast=True, profile = True)
    
    for i in range(1000):
        train()
    
    train.profile.print_summary()
    

    And here will be the profiling result:

    Message: LearnTheano.py:18
      Time in 1000 calls to Function.__call__: 2.642968e-01s
      Time in Function.fn.__call__: 2.460811e-01s (93.108%)
      Time in thunks: 1.877530e-01s (71.039%)
      Total compile time: 2.483290e+01s
        Number of Apply nodes: 17
        Theano Optimizer time: 2.818849e-01s
           Theano validate time: 3.435850e-03s
        Theano Linker time (includes C, CUDA code generation/compiling): 2.453926e+01s
           Import time 1.241469e-02s
    
    Time in all call to theano.grad() 1.206994e-02s
    Class
    ---
    <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
      34.8%    34.8%       0.065s       3.27e-05s     C     2000       2   theano.sandbox.cuda.blas.GpuGemm
      28.8%    63.5%       0.054s       1.80e-05s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuElemwise
      12.9%    76.4%       0.024s       2.42e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuCAReduce
      10.3%    86.7%       0.019s       1.93e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.GpuFromHost
       7.2%    93.9%       0.014s       1.36e-05s     C     1000       1   theano.sandbox.cuda.basic_ops.HostFromGpu
       1.8%    95.7%       0.003s       1.13e-06s     C     3000       3   theano.sandbox.cuda.basic_ops.GpuDimShuffle
       1.5%    97.2%       0.003s       2.81e-06s     C     1000       1   theano.tensor.elemwise.Elemwise
       1.1%    98.4%       0.002s       1.08e-06s     C     2000       2   theano.compile.ops.Shape_i
       1.1%    99.5%       0.002s       1.02e-06s     C     2000       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
       0.5%   100.0%       0.001s       9.96e-07s     C     1000       1   theano.tensor.opt.MakeVector
       ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)
    
    Ops
    ---
    <% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
      25.3%    25.3%       0.047s       4.74e-05s     C     1000        1   GpuGemm{no_inplace}
      12.9%    38.1%       0.024s       2.42e-05s     C     1000        1   GpuCAReduce{pre=sqr,red=add}{1,1}
      12.8%    51.0%       0.024s       2.41e-05s     C     1000        1   GpuElemwise{mul,no_inplace}
      10.3%    61.3%       0.019s       1.93e-05s     C     1000        1   GpuFromHost
       9.5%    70.8%       0.018s       1.79e-05s     C     1000        1   GpuGemm{inplace}
       8.2%    79.0%       0.015s       1.55e-05s     C     1000        1   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)]
       7.7%    86.7%       0.014s       1.44e-05s     C     1000        1   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
       7.2%    93.9%       0.014s       1.36e-05s     C     1000        1   HostFromGpu
       1.5%    95.4%       0.003s       2.81e-06s     C     1000        1   Elemwise{Cast{float32}}
       1.1%    96.5%       0.002s       1.02e-06s     C     2000        2   GpuSubtensor{int64}
       1.0%    97.5%       0.002s       9.00e-07s     C     2000        2   GpuDimShuffle{x,x}
       0.8%    98.3%       0.002s       1.59e-06s     C     1000        1   GpuDimShuffle{1,0}
       0.7%    99.1%       0.001s       1.38e-06s     C     1000        1   Shape_i{0}
       0.5%    99.6%       0.001s       9.96e-07s     C     1000        1   MakeVector
       0.4%   100.0%       0.001s       7.76e-07s     C     1000        1   Shape_i{1}
       ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)
    
    Apply
    ------
    <% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
      25.3%    25.3%       0.047s       4.74e-05s   1000     3   GpuGemm{no_inplace}(expected, TensorConstant{-1.0}, <CudaNdarrayType(float32, matrix)>, GpuDimShuffle{1,0}.0, TensorConstant{1.0})
      12.9%    38.1%       0.024s       2.42e-05s   1000     5   GpuCAReduce{pre=sqr,red=add}{1,1}(GpuGemm{no_inplace}.0)
      12.8%    51.0%       0.024s       2.41e-05s   1000    13   GpuElemwise{mul,no_inplace}(GpuDimShuffle{x,x}.0, GpuDimShuffle{x,x}.0)
      10.3%    61.3%       0.019s       1.93e-05s   1000     7   GpuFromHost(Elemwise{Cast{float32}}.0)
       9.5%    70.8%       0.018s       1.79e-05s   1000    16   GpuGemm{inplace}(<CudaNdarrayType(float32, matrix)>, TensorConstant{-9.99999974738e-05}, GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0, training-set, TensorConstant{1.0})
       8.2%    79.0%       0.015s       1.55e-05s   1000    12   GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)](GpuCAReduce{pre=sqr,red=add}{1,1}.0, GpuSubtensor{int64}.0, GpuSubtensor{int64}.0)
       7.7%    86.7%       0.014s       1.44e-05s   1000    15   GpuElemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](CudaNdarrayConstant{[[-2.]]}, GpuGemm{no_inplace}.0, GpuElemwise{mul,no_inplace}.0)
       7.2%    93.9%       0.014s       1.36e-05s   1000    14   HostFromGpu(GpuElemwise{Composite{((i0 / i1) / i2)}}[(0, 0)].0)
       1.5%    95.4%       0.003s       2.81e-06s   1000     6   Elemwise{Cast{float32}}(MakeVector.0)
       0.8%    96.3%       0.002s       1.59e-06s   1000     0   GpuDimShuffle{1,0}(training-set)
       0.7%    97.0%       0.001s       1.38e-06s   1000     2   Shape_i{0}(expected)
       0.7%    97.7%       0.001s       1.30e-06s   1000     8   GpuSubtensor{int64}(GpuFromHost.0, Constant{0})
       0.6%    98.3%       0.001s       1.08e-06s   1000    11   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
       0.5%    98.8%       0.001s       9.96e-07s   1000     4   MakeVector(Shape_i{0}.0, Shape_i{1}.0)
       0.4%    99.2%       0.001s       7.76e-07s   1000     1   Shape_i{1}(expected)
       0.4%    99.6%       0.001s       7.40e-07s   1000     9   GpuSubtensor{int64}(GpuFromHost.0, Constant{1})
       0.4%   100.0%       0.001s       7.25e-07s   1000    10   GpuDimShuffle{x,x}(GpuSubtensor{int64}.0)
       ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)