Looking at this code
for (i <- (L - 2) to (0, -1)) {
layerModels(i + 1).computePrevDelta(deltas(i + 1), outputs(i + 1), deltas(i))
}
I want to understand why are we passing outputs(i+1)
instead of outputs(i)
in the code snippet above. As far as I understand this is only needed for sigmoid activation layer which has a derivative as f'(x) = f(x) * (1-f(x)) = outputs(i) * (1-outputs(i))
Which means in order to find prevDelta we should be using outputs(i)
.
I figured why it is so. I will answer here if someone like me stumbles here by chance.
You have to notice that we are calculating delta for layer i which only depends on next (i+1 th) layer's delta and gradient. You have to notice that we are using layerModels(i + 1)
as needed and not layerModels(i)