Search code examples
tensorflowlstmlanguage-model

Get the probability distribution of next word given a sequence using TensorFlow's RNN (LSTM) language model?


I'm running TensorFlow's RNN (LSTM) language model example here. It runs and reports the perplexities perfectly.

What I want though is three things:

  1. Given a sequence (e.g. w1 w5 w2000 w750) give me the probability distribution for the next word over the vocabulary. I don't know how to do it with the model in the tutorial.

  2. I want the model to return a ranking of the most probable sequences (e.g. n-grams), n can be given as input.

and

  1. Given a sequence, I want it's probability.

I'm new to TensorFlow and RNNs so plz tell me if you need more information than I have provided.

The code for the language model is here.


Solution

  • I'm new to tensorflow and RNN too, so here's my thinking about your questions.
    Assuming you have a corpus consisting 2000 words (too small), the output of the i-th LSTM cell is a vector having 2000 elements each corresponding to a probability and this vector is the predicted probability distribution for the (i+1)th word.
    Back to your question.

    1. You just need to feed the input [w1,w5,w2000,w750] to RNN, and you get four vectors each having 2000 elements (the number of words in corpus), and then you pick up the last output vector and that's the predicted probability distribution of the 5th word and you also can do an argmax on this vector to find the most probable word for 5th position.

    2. I have no idea about this question even I can assign a probability to any given sequences.

    3. Also considering your input [w1,w5,w2000,w750], after calculating RNN you have four output vectors denoted as [v1,v2,v3,v4], and then you just need to find the probabilities of w5 in v1, w2000 in v2, w750 in v3 and multiply these probabilities and that's the probability of your input (v4 is not used because it is used to predict the next word of this sequence, w1 is also not used because it is usually the starting token).

    Edit:

    Once you have trained your model, you should get a embedding matrix embedding, a RNN cell cell and a softmax weights/biases softmax_w / softmanx_b, you can generate outputs using these three things.

    python

    def inference(inputs):
        """
        inputs: a list containing a sequence word ids
        """
        outputs = []
        state = cell.zero_state(1,tf.float32) # 1 means only one sequence
        embed = tf.embedding_lookup(embedding,inputs)
        sequence_length = len(inputs)
        for i in range(sequence_length):
            cell_output,state = cell(embed[:,i,:],state)
            logits = tf.nn.xw_plus_b(cell_output,softmax_w,softmax_b)
            probability = tf.nn.softmax(logits)
            outputs.append(probability)
        return outputs
    

    The final output is a list containg len(inputs) vectors / tensors, you can use sess.run(tensor) to get the value of a tensor in the form of numpy.array.
    This is a just simple function I wrote and should give you a general idea about how to generate outputs when you finish training.