keras recurrent-neural-network loss-function

interpreting strange output from keras.predict

im using keras for a multiclass clasffication of text-comments problem, this one, to be precise: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

There is six classes, and the observations could fall in all of them. I have trained my model(LSTM) , using binary-cross-entropy as my loss function

model.compile(loss = 'binary_crossentropy', 
          optimizer = 'adam', 
          metrics = ['accuracy'])

Now, for the report I am writing, I would like to be able to make som specific predictions. So I use predict() to try to get it to do some classifications

 y_pred = model.predict(padded_test,verbose=1)

"padded_test" here is a preprocessed test dataset. The problem is that when I call this method, then for this comment:

Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms,        just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27

I get some really strange predicition values:

array([7.9924166e-03, 2.0393365e-05, 1.5081263e-03, 2.9950817e-05,

Here I can see that many off the class prediction-values have exponents, and are ridiculously high. Why is is this? and how do I interpret these numbers?

Previously I tried with "categorical cross entropy" which gave me only values between 0-1, which is what I am looking for, however this messed up predicions entirely 1.9759631e-03, 2.7330496e-04], dtype=float32)

Solution

The prediction values that you are seeing are not high numbers. On the contrary, they are negative exponentials, for example the first one is equal to 0.0079924166.

In order to correctly explain these value, it is necessary to know the activation function that is used by the model in the output layer. For example:

if you were using the softmax activation function, the output values would represent the probability of the input sample to belong to each of the classes, and they would sum up to 1.
if you were using the sigmoid activation function (as it looks like in this case), the outputs would be values between 0 and 1, which would be independent one from the other.