Search code examples
kerasclassificationlstmanalysis

Get the probability of a word for text classification with LSTM in Keras


I'm doing sentiment classification using LSTM with Keras and I want to obtain the probability that the LSTM assigns to each word of a sentence in order to know which words are more representatives.

For example, for the following sentence:

"This landscape is wonderful and calming"

I consider that the most representative words for classifying the sentence into positive are "wonderful" and "calming" words.

How can I obtain the probability that LSTM assigns to each word?

lstm_layer = layers.LSTM(size)(embedding_layer)

output_layer1 = layers.Dense(50, activation=activation)(lstm_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)

model = models.Model(inputs=input_layer, outputs=output_layer2)
model.compile(optimizer=optimizer, loss='binary_crossentropy')

Thanks


Solution

  • You can get the probabilities from the final layer (dense layer with softmax). Example model:

    import keras
    import keras.layers as L
    
    # instantiate sequential model
    model = keras.models.Sequential()
    
    # define input layer
    model.add(L.InputLayer([None], dtype='int32'))
    
    # define embedding layer for dictionary size of 'len(all_words)' and 50 features/units
    model.add(L.Embedding(len(all_words), 50))
    
    # define fully-connected RNN with 64 output units. Crucially: we return the outputs of the RNN for every time step instead of just the last time step
    model.add(L.SimpleRNN(64, return_sequences=True))
    
    # define dense layer of 'len(all_words)' outputs and softmax activation
    # this will produce a vector of size len(all_words)
    stepwise_dense = L.Dense(len(all_words), activation='softmax')
    
    # The TimeDistributed layer adds a time dimension to the Dense layer so that it applies across the time dimension for every batch
    # That is, TimeDistributed applies the Dense layer to each time-step (input word) independently. Without it, the Dense layer would apply only once to all of the time-steps concatenated.
    # So, for the given time step (input word), each element 'i' in the output vector is the probability of the ith word from the target dictionary
    stepwise_dense = L.TimeDistributed(stepwise_dense)
    model.add(stepwise_dense)
    

    Then, compile and fit (train) your model:

    model.compile('adam','categorical_crossentropy')
    
    model.fit_generator(generate_batches(train_data),len(train_data)/BATCH_SIZE,
                        callbacks=[EvaluateAccuracy()], epochs=5,)
    

    Finally- just use the predict function to get the probabilities:

    model.predict(input_to_your_network)
    

    And just to be clear, the ith output unit of the softmax layer represents the predicted probability of the ith class (also see here).