Get the probability of a word for text classification with LSTM in Keras

I'm doing sentiment classification using LSTM with Keras and I want to obtain the probability that the LSTM assigns to each word of a sentence in order to know which words are more representatives.

For example, for the following sentence:

"This landscape is wonderful and calming"

I consider that the most representative words for classifying the sentence into positive are "wonderful" and "calming" words.

How can I obtain the probability that LSTM assigns to each word?

lstm_layer = layers.LSTM(size)(embedding_layer)

output_layer1 = layers.Dense(50, activation=activation)(lstm_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

model = models.Model(inputs=input_layer, outputs=output_layer2)
model.compile(optimizer=optimizer, loss='binary_crossentropy')

Thanks

Solution

You can get the probabilities from the final layer (dense layer with softmax). Example model:

import keras
import keras.layers as L

# instantiate sequential model
model = keras.models.Sequential()

# define input layer
model.add(L.InputLayer([None], dtype='int32'))

# define embedding layer for dictionary size of 'len(all_words)' and 50 features/units
model.add(L.Embedding(len(all_words), 50))

# define fully-connected RNN with 64 output units. Crucially: we return the outputs of the RNN for every time step instead of just the last time step
model.add(L.SimpleRNN(64, return_sequences=True))

# define dense layer of 'len(all_words)' outputs and softmax activation
# this will produce a vector of size len(all_words)
stepwise_dense = L.Dense(len(all_words), activation='softmax')

# The TimeDistributed layer adds a time dimension to the Dense layer so that it applies across the time dimension for every batch
# That is, TimeDistributed applies the Dense layer to each time-step (input word) independently. Without it, the Dense layer would apply only once to all of the time-steps concatenated.
# So, for the given time step (input word), each element 'i' in the output vector is the probability of the ith word from the target dictionary
stepwise_dense = L.TimeDistributed(stepwise_dense)
model.add(stepwise_dense)

Then, compile and fit (train) your model:

model.compile('adam','categorical_crossentropy')

model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                    callbacks=[EvaluateAccuracy()], epochs=5,)

Finally- just use the predict function to get the probabilities:

model.predict(input_to_your_network)

And just to be clear, the ith output unit of the softmax layer represents the predicted probability of the ith class (also see here).