I am trying to train an autoencoder NN (3 layers - 2 visible, 1 hidden) using numpy and scipy for the MNIST digits images dataset. The implementation is based on the notation given here Below is my code:
def autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, data):
"""
The input theta is a 1-dimensional array because scipy.optimize.minimize expects
the parameters being optimized to be a 1d array.
First convert theta from a 1d array to the (W1, W2, b1, b2)
matrix/vector format, so that this follows the notation convention of the
lecture notes and tutorial.
You must compute the:
cost : scalar representing the overall cost J(theta)
grad : array representing the corresponding gradient of each element of theta
"""
training_size = data.shape[1]
# unroll theta to get (W1,W2,b1,b2) #
W1 = theta[0:hidden_size*visible_size]
W1 = W1.reshape(hidden_size,visible_size)
W2 = theta[hidden_size*visible_size:2*hidden_size*visible_size]
W2 = W2.reshape(visible_size,hidden_size)
b1 = theta[2*hidden_size*visible_size:2*hidden_size*visible_size + hidden_size]
b2 = theta[2*hidden_size*visible_size + hidden_size: 2*hidden_size*visible_size + hidden_size + visible_size]
#feedforward pass
a_l1 = data
z_l2 = W1.dot(a_l1) + numpy.tile(b1,(training_size,1)).T
a_l2 = sigmoid(z_l2)
z_l3 = W2.dot(a_l2) + numpy.tile(b2,(training_size,1)).T
a_l3 = sigmoid(z_l3)
#backprop
delta_l3 = numpy.multiply(-(data-a_l3),numpy.multiply(a_l3,1-a_l3))
delta_l2 = numpy.multiply(W2.T.dot(delta_l3),
numpy.multiply(a_l2, 1 - a_l2))
b2_derivative = numpy.sum(delta_l3,axis=1)/training_size
b1_derivative = numpy.sum(delta_l2,axis=1)/training_size
W2_derivative = numpy.dot(delta_l3,a_l2.T)/training_size + lambda_*W2
#print(W2_derivative.shape)
W1_derivative = numpy.dot(delta_l2,a_l1.T)/training_size + lambda_*W1
W1_derivative = W1_derivative.reshape(hidden_size*visible_size)
W2_derivative = W2_derivative.reshape(visible_size*hidden_size)
b1_derivative = b1_derivative.reshape(hidden_size)
b2_derivative = b2_derivative.reshape(visible_size)
grad = numpy.concatenate((W1_derivative,W2_derivative,b1_derivative,b2_derivative))
cost = 0.5*numpy.sum((data-a_l3)**2)/training_size + 0.5*lambda_*(numpy.sum(W1**2) + numpy.sum(W2**2))
return cost,grad
I have also implemented a function to estimate the numerical gradient and verify the correctness of my implementation (below).
def compute_gradient_numerical_estimate(J, theta, epsilon=0.0001):
"""
:param J: a loss (cost) function that computes the real-valued loss given parameters and data
:param theta: array of parameters
:param epsilon: amount to vary each parameter in order to estimate
the gradient by numerical difference
:return: array of numerical gradient estimate
"""
gradient = numpy.zeros(theta.shape)
eps_vector = numpy.zeros(theta.shape)
for i in range(0,theta.size):
eps_vector[i] = epsilon
cost1,grad1 = J(theta+eps_vector)
cost2,grad2 = J(theta-eps_vector)
gradient[i] = (cost1 - cost2)/(2*epsilon)
eps_vector[i] = 0
return gradient
The norm of the difference between the numerical estimate and the one computed by the function is around 6.87165125021e-09 which seems to be acceptable. My main problem seems to be to get the gradient descent algorithm "L-BGFGS-B" working using the scipy.optimize.minimize
function as below:
# theta is the 1-D array of(W1,W2,b1,b2)
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
options_ = {'maxiter': 4000, 'disp': False}
result = scipy.optimize.minimize(J, theta, method='L-BFGS-B', jac=True, options=options_)
I get the below output from this:
scipy.optimize.minimize() details:
fun: 90.802022224079778
hess_inv: <16474x16474 LbfgsInvHessProduct with dtype=float64>
jac: array([ -6.83667742e-06, -2.74886002e-06, -3.23531941e-06, ...,
1.22425735e-01, 1.23425062e-01, 1.28091250e-01])
message: b'ABNORMAL_TERMINATION_IN_LNSRCH'
nfev: 21
nit: 0
status: 2
success: False
x: array([-0.06836677, -0.0274886 , -0.03235319, ..., 0. ,
0. , 0. ])
Now, this post seems to indicate that the error could mean that the gradient function implementation could be wrong? But my numerical gradient estimate seems to confirm that my implementation is correct. I have tried varying the initial weights by using a uniform distribution as specified here but the problem still persists. Is there anything wrong with my backprop implementation?
Turns out the issue was a syntax error (very silly) with this line:
J = lambda x: utils.autoencoder_cost_and_grad(theta, visible_size, hidden_size, lambda_, patches_train)
I don't even have the lambda
parameter x
in the function declaration. So the theta
array wasn't even being passed whenever J
was being invoked.
This fixed it:
J = lambda x: utils.autoencoder_cost_and_grad(x, visible_size, hidden_size, lambda_, patches_train)