I have implemented a bi-directional RNN in TensorFlow using a BasicLSTMCell
and rnn.bidirectional_rnn
. I am calculating the loss using seq2seq.sequence_loss_by_example
after concatenating the outputs I receive. My application is a next character predictor.
I getting an extremely low cost
, (~50 times lesser than the unidirectional RNN). I suspect I've made a mistake in the seq2seq.sequence_loss_by_example
step.
Here is my model -
# Model begins
cell_fn = rnn_cell.BasicLSTMCell
cell = fw_cell = cell_fn(args.rnn_size, state_is_tuple=True)
cell2 = bw_cell = cell_fn(args.rnn_size, state_is_tuple=True)
input_data = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
targets = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])
initial_state = fw_cell.zero_state(args.batch_size, tf.float32)
initial_state2 = bw_cell.zero_state(args.batch_size, tf.float32)
with tf.variable_scope('rnnlm'):
softmax_w = tf.get_variable("softmax_w", [2*args.rnn_size, args.vocab_size])
softmax_b = tf.get_variable("softmax_b", [args.vocab_size])
with tf.device("/cpu:0"):
embedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size])
input_embeddings = tf.nn.embedding_lookup(embedding, input_data)
inputs = tf.unpack(input_embeddings, axis=1)
outputs, last_state, last_state2 = rnn.bidirectional_rnn(fw_cell,
bw_cell,
inputs,
initial_state_fw=initial_state,
initial_state_bw=initial_state2,
dtype=tf.float32)
output = tf.reshape(tf.concat(1, outputs), [-1, 2*args.rnn_size])
logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)
loss = seq2seq.sequence_loss_by_example([logits],
[tf.reshape(targets, [-1])],
[tf.ones([args.batch_size * args.seq_length])],
args.vocab_size)
cost = tf.reduce_sum(loss) / args.batch_size / args.seq_length
lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
args.grad_clip)
optimizer = tf.train.AdamOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))
I think there is no any mistake in your code.
The problem is the objective function with the Bi-RNN model in your application (next character predictor).
The unidirectional RNN (such as ptb_word_lm or char-rnn-tensorflow), it is really a model used for the prediction, for example, if raw_text
is 1,3,5,2,4,8,9,0
, then, your inputs
and target
will be:
inputs: 1,3,5,2,4,8,9
target: 3,5,2,4,8,9,0
and the prediction is (1)->3
, (1,3)->5
, ..., (1,3,5,2,4,8,9)->0
But in Bi-RNN, the first prediction is really not just (1)->3
, because the output[0]
in your code contians the reverse information of the raw_text
by use bw_cell
(also not (1,3)->5
, ..., (1,3,5,2,4,8,9)->0
). A similar example is: I tell you that flower is a rose, and than I let you to predict what the flower is? I think you can give me the right answer very easy, and this is also the reason why you getting an extremely low loss
in your Bi-RNN model for the application.
In fact, I think Bi-RNN (or Bi-LSTM) is not an appropriate model for the application of next character predictor. Bi-RNN need the full sequence when it works, you will find you can't use this model easily when you want to predict the next character.