I am a beginner in deep learning. I am currently struggled with the back propagation algorithms. I found this piece of code online of the back propagation of a simple neural net with sigmoid activation function.
#Step 1 Collect Data
x = np.array([[0,0,1], [0,1,1], [1,0,1], [1,1,1]])
y = np.array([[0], [1], [1], [0]])
#Step 2 build model
num_epochs = 60000
#initialize weights
syn0 = 2np.random.random((3,4)) - 1
syn1 = 2np.random.random((4,1)) - 1
def nonlin(x,deriv=False):
if(deriv==True): return x*(1-x)
return 1/(1+np.exp(-x))
for j in xrange(num_epochs): #feed forward through layers 0,1, and 2
k0 = x
k1 = nonlin(np.dot(k0, syn0))
k2 = nonlin(np.dot(k1, syn1))
#how much did we miss the target value?
k2_error = y - k2
if (j% 10000) == 0:
print "Error:" + str(np.mean(np.abs(k2_error)))
#in what direction is the target value?
k2_delta = k2_error*nonlin(k2, deriv=True)
#how much did each k1 value contribute to k2 error
k1_error = k2_delta.dot(syn1.T)
k1_delta= k1_error * nonlin(k1,deriv=True)
syn1 += k1.T.dot(k2_delta)
syn0 += k0.T.dot(k1_delta)
I do not get this line of code: k2_delta = k2_error*nonlin(k2, deriv=True)
. When calculating the local gradient why it uses k2_error
multiply the derivative of k2. Should we use a different thing instead of k2_error
because the cost function in this algorithm is absolute value, so should I use a vector of [-1,1,1,-1]
as the local gradient of the cost function? I assume here it uses analytics gradient.
You can use k2_error
as it's written. I tested your code (after making a formatting change) and confirmed that it minimizes the absolute error, which is different from k2_error
(the ostensible but not actual target of gradient descent in your algorithm). k2_delta = k2_error*nonlin(k2, deriv=True)
because the algorithm is minimizing the absolute error instead of k2_error
. Here's how that works:
The relationship between k2_error
and the input to k2
The derivative of k2_error
with respect to k2
is -1. Using the chain rule, the derivative of k2_error
with respect to the input of k2
is (-1)*(nonlin(k2, deriv=True))
.
Thus:
k2_error
with respect to the input of k2
is always negative. This is because (nonlin(k2, deriv=True))
is always positive.k2_error
will therefore always want to push the input of k2
up (make it more positive) to make k2_error
more negative.Minimize the absolute error
There are two practical possibilities for k2_error = y-k2
, and each possibility implies a different strategy for minimizing the absolute error (our real goal). (There's an unlikely third possibility that we can ignore.)
Case 1: y
<k2
, which means k2_error
<0
y
and k2
closer together (minimize the absolute error), we need to make the error larger/more-positive. We know from the first section that we can do this by pushing the input of k2
down (k2_error
increases when the input to k2
decreases).Case 2: y
>k2
, which means k2_error
>0
y
and k2
closer together (minimize the absolute error), we need to make the error smaller/more-negative. We know from the first section that we can do this by pushing the input of k2
up (k2_error
decreases when the input to k2
increases).In summary, if k2_error
is negative (Case 1), we minimize the absolute error by pushing the input of k2
down. If k2_error
is positive (Case 2), we minimize the absolute error by pushing the input of k2
up.
Explanation of k2_delta
We now know that gradient descent minimization of k2_error
will always want to push the input of k2
up, but this will only minimize the absolute error when y
> k2
(Case 2 from above). In Case 1, pushing the input of k2
up will increase the absolute error--so we modify the gradient at the input of k2
, which is called k2_delta
, by flipping its sign whenever we are facing Case 1. Case 1 implies that k2_error
<0, which means that we can flip the sign of the gradient by multiplying k2_delta
by k2_error
! Using this flip means that when we see Case 1, gradient descent wants to push the input of k2
down instead of up (we force gradient descent to abandon its default behavior).
To summarize, using k2_delta = k2_error*nonlin(k2, deriv=True)
flips the sign of the usual gradient only when we're facing Case 1, which ensures we are always minimizing the absolute error (as opposed to minimizing k2_error
).
Important notes
Your algorithm modifies the weights by adding the negative gradient. Typically, gradient descent modifies the weights by subtracting the gradient. Adding the negative gradient is the same thing, but it does complicate my answer somewhat. For instance, the gradient at the input of k2
is actually k2_delta = k2_error*(-1)*nonlin(k2, deriv=True)
, not k2_delta = k2_error*nonlin(k2, deriv=True)
.
You may be wondering why we use k2_error
instead of sign(k2_error)
, and that's because we want to move the weights by smaller amounts as k2_error
becomes smaller.