python numpy machine-learning neural-network backpropagation

how can I do a back propagation for a simple neural net with sigmoid activation?

I am a beginner in deep learning. I am currently struggled with the back propagation algorithms. I found this piece of code online of the back propagation of a simple neural net with sigmoid activation function.

#Step 1 Collect Data
x = np.array([[0,0,1], [0,1,1], [1,0,1], [1,1,1]])
y = np.array([[0], [1], [1], [0]])

#Step 2 build model
num_epochs = 60000

#initialize weights
syn0 = 2np.random.random((3,4)) - 1 
syn1 = 2np.random.random((4,1)) - 1

def nonlin(x,deriv=False):
    if(deriv==True): return x*(1-x)
        return 1/(1+np.exp(-x))
    for j in xrange(num_epochs): #feed forward through layers 0,1, and 2
        k0 = x
        k1 = nonlin(np.dot(k0, syn0))
        k2 = nonlin(np.dot(k1, syn1))

        #how much did we miss the target value?
        k2_error = y - k2

        if (j% 10000) == 0:
            print "Error:" + str(np.mean(np.abs(k2_error)))

        #in what direction is the target value?
        k2_delta = k2_error*nonlin(k2, deriv=True)

        #how much did each k1 value contribute to k2 error
        k1_error = k2_delta.dot(syn1.T)
        k1_delta= k1_error * nonlin(k1,deriv=True)
        syn1 += k1.T.dot(k2_delta)
        syn0 += k0.T.dot(k1_delta)

I do not get this line of code: k2_delta = k2_error*nonlin(k2, deriv=True). When calculating the local gradient why it uses k2_error multiply the derivative of k2. Should we use a different thing instead of k2_error because the cost function in this algorithm is absolute value, so should I use a vector of [-1,1,1,-1] as the local gradient of the cost function? I assume here it uses analytics gradient.

Solution

You can use k2_error as it's written. I tested your code (after making a formatting change) and confirmed that it minimizes the absolute error, which is different from k2_error (the ostensible but not actual target of gradient descent in your algorithm). k2_delta = k2_error*nonlin(k2, deriv=True) because the algorithm is minimizing the absolute error instead of k2_error. Here's how that works:

The relationship between k2_error and the input to k2

The derivative of k2_error with respect to k2 is -1. Using the chain rule, the derivative of k2_error with respect to the input of k2 is (-1)*(nonlin(k2, deriv=True)).

Thus:

The derivative of k2_error with respect to the input of k2 is always negative. This is because (nonlin(k2, deriv=True)) is always positive.
Regular gradient descent minimization of k2_error will therefore always want to push the input of k2 up (make it more positive) to make k2_error more negative.

Minimize the absolute error

There are two practical possibilities for k2_error = y-k2, and each possibility implies a different strategy for minimizing the absolute error (our real goal). (There's an unlikely third possibility that we can ignore.)

Case 1: y<k2, which means k2_error<0
- To make y and k2 closer together (minimize the absolute error), we need to make the error larger/more-positive. We know from the first section that we can do this by pushing the input of k2 down (k2_error increases when the input to k2 decreases).
Case 2: y>k2, which means k2_error>0
- To make y and k2 closer together (minimize the absolute error), we need to make the error smaller/more-negative. We know from the first section that we can do this by pushing the input of k2 up (k2_error decreases when the input to k2 increases).

In summary, if k2_error is negative (Case 1), we minimize the absolute error by pushing the input of k2 down. If k2_error is positive (Case 2), we minimize the absolute error by pushing the input of k2 up.

Explanation of k2_delta

We now know that gradient descent minimization of k2_error will always want to push the input of k2 up, but this will only minimize the absolute error when y > k2 (Case 2 from above). In Case 1, pushing the input of k2 up will increase the absolute error--so we modify the gradient at the input of k2, which is called k2_delta, by flipping its sign whenever we are facing Case 1. Case 1 implies that k2_error<0, which means that we can flip the sign of the gradient by multiplying k2_delta by k2_error! Using this flip means that when we see Case 1, gradient descent wants to push the input of k2 down instead of up (we force gradient descent to abandon its default behavior).

To summarize, using k2_delta = k2_error*nonlin(k2, deriv=True) flips the sign of the usual gradient only when we're facing Case 1, which ensures we are always minimizing the absolute error (as opposed to minimizing k2_error).

Important notes

Your algorithm modifies the weights by adding the negative gradient. Typically, gradient descent modifies the weights by subtracting the gradient. Adding the negative gradient is the same thing, but it does complicate my answer somewhat. For instance, the gradient at the input of k2 is actually k2_delta = k2_error*(-1)*nonlin(k2, deriv=True), not k2_delta = k2_error*nonlin(k2, deriv=True).

You may be wondering why we use k2_error instead of sign(k2_error), and that's because we want to move the weights by smaller amounts as k2_error becomes smaller.