Neural Network Backpropagation?

Can anyone recommend a website or give me a brief of how backpropagation is implemented in a NN? I understand the basic concept, but I'm unsure of how to go about writing the code.

Many of sources I've found simply show equations without giving any explanation of why they're doing it, and the variable names make it difficult to find out.

Example:

void bpnn_output_error(delta, target, output, nj, err)
double *delta, *target, *output, *err;
int nj;
{
  int j;
  double o, t, errsum;

  errsum = 0.0;
  for (j = 1; j <= nj; j++) {
    o = output[j];
    t = target[j];
    delta[j] = o * (1.0 - o) * (t - o);
    errsum += ABS(delta[j]);
  }
  *err = errsum;

}

In that example, can someone explain the purpose of

delta[j] = o * (1.0 - o) * (t - o);

Thanks.

Solution

The purpose of

delta[j] = o * (1.0 - o) * (t - o);

is to find the error of an output node in a backpropagation network.

o represents the output of the node, t is the expected value of output for the node.

The term, (o * (1.0 - o), is the derivative of a common transfer function used, the sigmoid function. (Other transfer functions are not uncommon, and would require a rewrite of the code that has the sigmoid first derivative instead. A mismatch between function and derivative would likely mean that training would not converge.) The node has an "activation" value that is fed through a transfer function to obtain the output o, like

o = f(activation)

The main thing is that backpropagation uses gradient descent, and the error gets backward-propagated by application of the Chain Rule. The problem is one of credit assignment, or blame if you will, for the hidden nodes whose output is not directly comparable to the expected value. We start with what is known and comparable, the output nodes. The error is taken to be proportional to the first derivative of the output times the raw error value between the expected output and actual output.

So more symbolically, we'd write that line as

delta[j] = f'(activation_j) * (t_j - o_j)

where f is your transfer function, and f' is the first derivative of it.

Further back in the hidden layers, the error at a node is its estimated contribution to the errors found at the next layer. So the deltas from the succeeding layer are multiplied by the connecting weights, and those products are summed. That sum is multiplied by the first derivative of the activation of the hidden node to get the delta for a hidden node, or

delta[j] = f'(activation_j) * Sum(delta[k] * w_jk)

where j now references a hidden node and k a node in a succeeding layer.