numpy scipy neural-network backpropagation

Implementing backpropagation gradient descent using scipy.optimize.minimize

I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:

def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
    cost : scalar representing the overall cost J(theta)
    grad : array representing the corresponding gradient of each element of theta
"""

  training_size = data.shape[1]
  # unroll theta to get (W1,W2,b1,b2) #
  W1 = theta[0:hidden_size*visible_size]
  W1 = W1.reshape(hidden_size,visible_size)

  W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
  W2 = W2.reshape(visible_size,hidden_size)

  b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
  b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]

  #feedforward pass
  a_l1 = data

  z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
  a_l2 = sigmoid(z_l2)

  z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
  a_l3 = sigmoid(z_l3)

  #backprop
  delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
  delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
                             numpy.multiply(a_l2, 1 - a_l2))

  b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
  b1_derivative = numpy.sum(delta_l2,axis=1)/training_size

  W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
  #print(W2_derivative.shape)
  W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1

  W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
  W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
  b1_derivative = b1_derivative.reshape(hidden_size)
  b2_derivative = b2_derivative.reshape(visible_size)


  grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
  cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
  return cost,grad

I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).

def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
                the gradient by numerical difference
:return: array of numerical gradient estimate
"""

  gradient = numpy.zeros(theta.shape)

  eps_vector = numpy.zeros(theta.shape)
  for i in range(0,theta.size):

      eps_vector[i] = epsilon
      cost1,grad1 = J(theta+eps_vector)
      cost2,grad2 = J(theta-eps_vector)
      gradient[i] = (cost1 - cost2)/(2*epsilon)
      eps_vector[i] = 0


  return gradient

The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize function as below:

# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)

I get the below output from this:

scipy.optimize.minimize() details:
  fun: 90.802022224079778
 hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
  jac: array([ -6.83667742e-06,  -2.74886002e-06,  -3.23531941e-06, ...,
     1.22425735e-01,   1.23425062e-01,   1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
 nfev: 21
  nit: 0
 status: 2
success: False
    x: array([-0.06836677, -0.0274886 , -0.03235319, ...,  0.        ,
    0.        ,  0.        ])

Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?

Solution

Turns out the issue was a syntax error (very silly) with this line:

J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)

I don't even have the lambda parameter x in the function declaration. So the theta array wasn't even being passed whenever J was being invoked.

This fixed it:

J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)