I have a supervised learning task f(X)=y where X is a 2-dimentional np.array of np.int8 and y is a 1-dimentional array of np.float64 containing probabilities (so numbers between 0 and 1). I want to build a Neural Network model that performs regression in order to predict said probabilities y given X.
As the output of my Network is one real value (i.e. the output layer has one neuron) and is a probability (so in the range [0, 1]), I believe I should use softmax as the activation function of the output layer (i.e. output neuron) in order to squash the network's output to [0, 1].
As it is a regression task, I opted for using the mean_squared_error loss (instead of cross_entropy_loss that is typically used in classification tasks and often paired with softmax).
However, as I am trying to fit(X, y) the loss does not change at all between epochs and remains constant. Any ideas why? Is the combination of softmax and mean_squared_error loss wrong for some reason and why?
If I remove the softmax it does work, but then my model would also predict non probabilities which I do not want. Yes, I could squash it myself later but it doesn't seem right.
My code basically is (after removing some irrelevant additional callbacks for EarlyStopping and learning rate scheaduling):
model = Sequential()
model.add(Dense(W1_size, input_shape=(input_dims,), activation='relu'))
model.add(Dense(1, activation='softmax'))
# compile model
model.compile(optimizer=Adam(), loss='mse') # mse is the standard loss for regression
# fit
model.fit(X, y, batch_size=batch_size, epochs=MAX_EPOCHS)
Edit: Turns out I needed the sigmoid function to squash one real value to [0, 1] as the accepted answer suggests. The softmax function for a vector of size 1 is always 1.
As you stated you want to perform a regression task
. (Which means, finding a continuous mapping between your input and desired output).
The softmax
function creates a pseudo-probability distribution for multi-dimensional outputs (all values sum up to 1
). This is the reason why the softmax
function perfectly fits for classification tasks
(predicting probabilities for different classes).
As you want to perform a regression task
and your output is one-dimensional, softmax would not work properly because it is always 1
for a one-dimensional input.
A function which maps a one-dimensional input continuously to [0,1] works fine here (e.g Sigmoid).
Note that you can also interpret both the output of the sigmoid
and the softmax
function as probabilities. But be careful: these are only pseudo-probabilities and it is not representing the certainty or uncertainty of your model in making predictions.