Search code examples
kerasregressionprobabilitysoftmax

Why can't I use softmax in regression task for probabilities?


I have a supervised learning task f(X)=y where X is a 2-dimentional np.array of np.int8 and y is a 1-dimentional array of np.float64 containing probabilities (so numbers between 0 and 1). I want to build a Neural Network model that performs regression in order to predict said probabilities y given X.

As the output of my Network is one real value (i.e. the output layer has one neuron) and is a probability (so in the range [0, 1]), I believe I should use softmax as the activation function of the output layer (i.e. output neuron) in order to squash the network's output to [0, 1].

As it is a regression task, I opted for using the mean_squared_error loss (instead of cross_entropy_loss that is typically used in classification tasks and often paired with softmax).

However, as I am trying to fit(X, y) the loss does not change at all between epochs and remains constant. Any ideas why? Is the combination of softmax and mean_squared_error loss wrong for some reason and why?

If I remove the softmax it does work, but then my model would also predict non probabilities which I do not want. Yes, I could squash it myself later but it doesn't seem right.

My code basically is (after removing some irrelevant additional callbacks for EarlyStopping and learning rate scheaduling):

model = Sequential()
model.add(Dense(W1_size, input_shape=(input_dims,), activation='relu'))
model.add(Dense(1, activation='softmax'))
# compile model
model.compile(optimizer=Adam(), loss='mse')   # mse is the standard loss for regression
# fit
model.fit(X, y, batch_size=batch_size, epochs=MAX_EPOCHS)

Edit: Turns out I needed the sigmoid function to squash one real value to [0, 1] as the accepted answer suggests. The softmax function for a vector of size 1 is always 1.


Solution

  • As you stated you want to perform a regression task. (Which means, finding a continuous mapping between your input and desired output). The softmax function creates a pseudo-probability distribution for multi-dimensional outputs (all values sum up to 1). This is the reason why the softmax function perfectly fits for classification tasks (predicting probabilities for different classes).

    As you want to perform a regression task and your output is one-dimensional, softmax would not work properly because it is always 1 for a one-dimensional input. A function which maps a one-dimensional input continuously to [0,1] works fine here (e.g Sigmoid).

    Note that you can also interpret both the output of the sigmoid and the softmax function as probabilities. But be careful: these are only pseudo-probabilities and it is not representing the certainty or uncertainty of your model in making predictions.