machine-learning neural-network recurrent-neural-network backpropagation gradient-descent

What is the error term in backpropagation through time if I only have one output?

In this question, RNN: Back-propagation through time when output is taken only at final timestep I've seen that if I only have one output at final time step T, which is y(T), then the error at earlier time step is unneeded.

Then, is the loss function term E = sum(E(t)) instead the value of E = E(T), T is the time scan from the start to the final time step?

Solution

I'll try to clear some confusion. This is the standard unrolled RNN:

And suppose that o[t+1] is the very last step.

When all the outputs are actually used by the network, the error is going to backpropagate through all vertical edges: to s[t+1] from o[t+1], to s[t] from o[t], ..., to s[0] from o[0]. In addition, all cells except for the last one receive the error from the subsequent cell: s[t] from s[t+1], s[t-1] from s[t], etc.

It's easy to notice that all cells s[0] ... s[t] receive two error messages and they are added up (hence the sum).

Now, the situation discussed by Denny Britz is that only o[t+1] is used by the network and all other outputs are ignored. This is equivalent to zero gradients flowing from o[t], o[t-1], ..., o[0]. Technically, the total gradient received in s[i] is still a sum, but it's the sum of one element (and a zero). Effectively, the error is going to be backpropagated like this:

o[t+1] -> s[t+1] -> s[t] -> s[t-1] -> ... -> s[0]

Few other remarks:

Then, is the loss function term E = sum(E(t)) instead the value of E = E(T)

I haven't touched upon the loss function in this example. The loss is sitting above and is comparing the outputs to the labels and initiates the first backward message. In both cases, it's the same. The difference is only whether the error message flows through all of o[i] or only through o[t+1].