theano hard_sigmoid() breaks gradient descent

for intents of highlighting the issue lets follow this tutorial.

theano has 3 ways to compute the sigmoid of a tensor, namely sigmoid, ultra_fast_sigmoid and hard_sidmoid. It seems using the latter two breaks the gradient descent algorithm.

The conventional sigmoid converges as it should, but the the others have strange inconsistent behaviours. ultra_fast_sigmoid, just throws a straight error when trying to compute the gradient 'Method not defined ('grad', ultra_fast_sigmoid)', whilst hard_sigmoid compiles fine, but fails to converge on the solution.

Does anyone know the source of this behaviour? It s not highlighted in the documentation that this should happen and it seems counter intuitive.


import theano
import theano.tensor as T
import theano.tensor.nnet as nnet
import numpy as np

x = T.dvector()
y = T.dscalar()

def layer(x, w):
    b = np.array([1], dtype=theano.config.floatX)
    new_x = T.concatenate([x, b])
    m =, new_x) #theta1: 3x3 * x: 3x1 = 3x1 ;;; theta2: 1x4 * 4x1

    h = nnet.sigmoid(m) ## THIS SIGMOID RIGHT HERE

    return h

def grad_desc(cost, theta):
    alpha = 0.1 #learning rate
    return theta - (alpha * T.grad(cost, wrt=theta))

theta1 = theano.shared(np.array(np.random.rand(3,3), dtype=theano.config.floatX))
theta2 = theano.shared(np.array(np.random.rand(4,1), dtype=theano.config.floatX))

hid1 = layer(x, theta1) #hidden layer

out1 = T.sum(layer(hid1, theta2)) #output layer
fc = (out1 - y)**2 #cost expression

cost = theano.function(inputs=[x, y], outputs=fc, updates=[
        (theta1, grad_desc(fc, theta1)),
        (theta2, grad_desc(fc, theta2))])
run_forward = theano.function(inputs=[x], outputs=out1)

inputs = np.array([[0,1],[1,0],[1,1],[0,0]]).reshape(4,2) #training data X
exp_y = np.array([1, 1, 0, 0]) #training data Y
cur_cost = 0
for i in range(2000):
    for k in range(len(inputs)):
        cur_cost = cost(inputs[k], exp_y[k]) #call our Theano-compiled cost function, it will auto update weights
    if i % 500 == 0: #only print the cost every 500 epochs/iterations (to save space)
        print('Cost: %s' % (cur_cost,))


i changed the following lines from the code to make the output shorter for this post (they differ from the tutorial, but are already contained in the code above):

from theano.tensor.nnet import binary_crossentropy as cross_entropy #imports
fc = cross_entropy(out1, y) #cost expression
for i in range(4000): #training iteration


Cost: 1.62724279493
Cost: 0.545966632545
Cost: 0.156764560912
Cost: 0.0534911098234
Cost: 0.0280394147992
Cost: 0.0184933786794
Cost: 0.0136444190935
Cost: 0.0107482836159


  File "", line 30, in <module>
    (theta1, grad_desc(fc, theta1)),
  File "", line 19, in grad_desc
    return theta - (alpha * T.grad(cost, wrt=theta))
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 545, in grad
    grad_dict, wrt, cost_name)
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1283, in _populate_grad_dict
    rval = [access_grad_cache(elem) for elem in wrt]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1241, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 951, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1241, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 951, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1241, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 951, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1241, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 951, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1241, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 951, in access_term_cache
    output_grads = [access_grad_cache(var) for var in node.outputs]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1241, in access_grad_cache
    term = access_term_cache(node)[idx]
  File "/usr/local/lib/python2.7/dist-packages/theano/", line 1089, in access_term_cache
    input_grads = node.op.grad(inputs, new_output_grads)
  File "/usr/local/lib/python2.7/dist-packages/theano/tensor/", line 662, in grad
    rval = self._bgrad(inputs, ograds)
  File "/usr/local/lib/python2.7/dist-packages/theano/tensor/", line 737, in _bgrad
    scalar_igrads = self.scalar_op.grad(scalar_inputs, scalar_ograds)
  File "/usr/local/lib/python2.7/dist-packages/theano/scalar/", line 878, in grad
theano.gof.utils.MethodNotDefined: ('grad', <class 'theano.tensor.nnet.sigm.UltraFastScalarSigmoid'>, 'UltraFastScalarSigmoid')


Cost: 1.19810193303
Cost: 0.684360309062
Cost: 0.692614056124
Cost: 0.697902474354
Cost: 0.701540531661
Cost: 0.703807604483
Cost: 0.70470238116
Cost: 0.704385738831


  • Here's the source code of hard_sigmoid:

    def hard_sigmoid(x):
        """An approximation of sigmoid.
        More approximate and faster than ultra_fast_sigmoid.
        Approx in 3 parts: 0, scaled linear, 1
        Removing the slope and shift does not make it faster.
        # Use the same dtype as determined by "upgrade_to_float",
        # and perform computation in that dtype.
        out_dtype = scalar.upgrade_to_float(scalar.Scalar(dtype=x.dtype))[0].dtype
        slope = tensor.constant(0.2, dtype=out_dtype)
        shift = tensor.constant(0.5, dtype=out_dtype)
        x = (x * slope) + shift
        x = tensor.clip(x, 0, 1)
        return x

    So it is just implemented as a piecewise linear function, whose gradient is 0.2 within the range of (-2.5, 2.5) and 0 elsewhere. Which means if the input falls outside the region (-2.5, 2.5), its gradient will be zero, and no learning will happen.

    So it might not be appropriate for training, but can be use for approximating the prediction result.

    To evaluate the gradient of the network parameters, normally we use backpropagation.
    Here's a very simple example.

    x = theano.tensor.scalar()
    w = theano.shared(numpy.float32(1))
    y = theano.tensor.nnet.hard_sigmoid(w*x)  # y=w*x, w is initialized to 1.
    dw = theano.grad(y, w)  # gradient wrt w, which is equal to slope*x in this case
    net = theano.function([x], [y, dw])
    print net(-3)
    print net(-1)
    print net(0)
    print net(1)
    print net(3)
    [array(0.0), array(-0.0)]  # zero gradient because the slope is zero
    [array(0.3), array(-0.2)]
    [array(0.5), array(0.0)]  # zero gradient because x is zero
    [array(0.7), array(0.2)]
    [array(1.0), array(0.0)]  # zero gradient because the slope is zero

    ultra_hard_sigmoid fails, if you look at the source code implementation, because it is hard-coded in python and not handled by tensor expressions.