Search code examples
pythonkerasdeep-learningconv-neural-networkactivation

Using softmax in hidden layer and relu in output layer for CNN regression


I am creating a CNN to predict the distributed strain applied to an optical fiber from the measured light spectrum (2D), which is ideally a Lorentzian curve. The label is a 1D array where only the strained section is non-zero (the label looks like a square wave).

My CNN has 10 alternating convolution and pooling layers, all activated by RelU. This is then followed by 3 fully-connected hidden layers with softmax activation, then an output layer activated by RelU. Usually, CNNs and other neural networks make use of RelU for hidden layers and then softmax for output layer (in the case of classification problems). But in this case, I use softmax to first determine the positions of optical fiber for which strain is applied (i.e. non-zero) and then use RelU in the output for regression. My CNN is able to predict the labels rather accurately, but I cannot find any supporting publication where softmax is used in hidden layers, followed by RelU in the output layer; nor why this approach is conversely not recommended (i.e. not mathematically possible) other than those I found in Quora/Stackoverflow. I would really appreciate if anyone could enlighten me on this matter as I am pretty new to deep learning, and wish to learn from this. Thank you in advance!


Solution

  • If you look at the way a layer l sees the input from a previous layer l-1, it is assuming that the dimensions of the feature vector are linearly independent.

    If the model is building some kind of confidence using a set of neurons, then the neurons better be linearly independent otherwise it is simply exaggerating the value of 1 neuron.

    If you apply softmax in hidden layers then you are essentially combining multiple neurons and tampering with their independence. Also if you look at the reasons why ReLU is preferred is because it can give you a better gradients which other activations like sigmoid won't. Also if your goal is too add normalization to your layers, you’d be better off with using an explicit batch normalization layer