Search code examples
neural-networkregressionbackpropagationderivativesoftmax

How to implement the Softmax derivative independently from any loss function?


For a neural networks library I implemented some activation functions and loss functions and their derivatives. They can be combined arbitrarily and the derivative at the output layers just becomes the product of the loss derivative and the activation derivative.

However, I failed to implement the derivative of the Softmax activation function independently from any loss function. Due to the normalization i.e. the denominator in the equation, changing a single input activation changes all output activations and not just one.

Here is my Softmax implementation where the derivative fails the gradient checking by about 1%. How can I implement the Softmax derivative so that it can be combined with any loss function?

import numpy as np


class Softmax:

    def compute(self, incoming):
        exps = np.exp(incoming)
        return exps / exps.sum()

    def delta(self, incoming, outgoing):
        exps = np.exp(incoming)
        others = exps.sum() - exps
        return 1 / (2 + exps / others + others / exps)


activation = Softmax()
cost = SquaredError()

outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)

Solution

  • Mathematically, the derivative of Softmax σ(j) with respect to the logit Zi (for example, Wi*X) is

    enter image description here

    where the red delta is a Kronecker delta.

    If you implement this iteratively in python:

    def softmax_grad(s):
        # input s is softmax value of the original input x. Its shape is (1,n) 
        # i.e.  s = np.array([0.3,0.7]),  x = np.array([0,1])
    
        # make the matrix whose size is n^2.
        jacobian_m = np.diag(s)
    
        for i in range(len(jacobian_m)):
            for j in range(len(jacobian_m)):
                if i == j:
                    jacobian_m[i][j] = s[i] * (1 - s[i])
                else: 
                    jacobian_m[i][j] = -s[i] * s[j]
        return jacobian_m
    

    Test:

    In [95]: x
    Out[95]: array([1, 2])
    
    In [96]: softmax(x)
    Out[96]: array([ 0.26894142,  0.73105858])
    
    In [97]: softmax_grad(softmax(x))
    Out[97]: 
    array([[ 0.19661193, -0.19661193],
           [-0.19661193,  0.19661193]])
    

    If you implement in a vectorized version:

    soft_max = softmax(x)    
    
    # reshape softmax to 2d so np.dot gives matrix multiplication
    
    def softmax_grad(softmax):
        s = softmax.reshape(-1,1)
        return np.diagflat(s) - np.dot(s, s.T)
    
    softmax_grad(soft_max)
    
    #array([[ 0.19661193, -0.19661193],
    #       [-0.19661193,  0.19661193]])
    

    Source: https://medium.com/intuitionmath/how-to-implement-the-softmax-derivative-independently-from-any-loss-function-ae6d44363a9d