Search code examples
machine-learningartificial-intelligencebackpropagationloss-functionactivation-function

Derivative of activation function vs partial derivative wrt. loss function


Some terms in AI are confusing me. The derivative function used in backpropagation is the derivative of activation function or the derivative of loss function?

These terms are confusing: derivative of act. function, partial derivative wrt. loss function??

I'm still not getting it correct.


Solution

  • When you optimize a model, you define a loss function. This typically represents the error with respect to some training data.

    It is common to use gradient based optimization to minimize this error. Typically, stochastic gradient descent (SGD) and related approaches (Adam, Adagrad, etc.).

    The gradient of the loss function, is a vector composed of the partial derivatives of the loss with respect to each of the weights in the model.

    In each iteration, weights are updated against the direction of the gradient (remember we are minimizing).

    I guess the reason you might be confused is because due to the chain rule, when calculating the gradient of the loss function, you are required to differentiate activation functions. But keep in mind that this is because of the chain rule.