Summary and Test Cases
The core issue is that Tensorflow throws OOM allocations on a batch that is not the first, as I would expect. Therefore, I believe there is a memory leak since all memory is clearly not being freed after each batch.
num_units: 50, batch_size: 1000; fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 800, fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 750; fails OOM (gpu) after 10th batch (???)
num_units: 50, batch_size: 500; fails OOM (gpu) after 90th batch (???)
num_units: 50, batch_size: 300; fails OOM (gpu) after 540th batch (???)
num_units: 50, batch_size: 200; computer freezes after around 900 batches with 100% ram use
num_units: 50, batch_size: 100; passes 1 epoch -- may fail later (unknown)
Explanation:
Essentially, it runs 144
batch with a batch size of 500
before failing on the 145th batch, which seems strange. If it can't allocate enough memory for the 145th batch, why should it work for the first 144? The behavior can be replicated.
Note that each batch DOES vary in size, since each one has dimensions [BATCH_SIZE, MAX_SEQUENCE_LENGTH]
, and depending on the sequences sampled, the sequence length varies, but the program does not fail on the largest batch; it fails later on a smaller one. Therefore, I conclude that a single oversized batch is not causing the memory error; it appears to be a memory leak.
With a larger batch size, the program fails earlier; with a smaller batch size, it fails later.
The full error is here:
Traceback (most recent call last):
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/me/IdeaProjects/tf-nmt/main.py", line 89, in <module>
_ = sess.run([update_step])
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
Caused by op 'decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul', defined at:
File "/home/me/IdeaProjects/tf-nmt/main.py", line 49, in <module>
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 309, in dynamic_decode
swap_memory=swap_memory)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2819, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2643, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2593, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 254, in body
decoder_finished) = decoder.step(time, inputs, state)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 138, in step
cell_outputs, cell_state = self._cell(inputs, state)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 290, in __call__
return base_layer.Layer.__call__(self, inputs, state, scope=scope)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 618, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 567, in call
array_ops.concat([inputs, h], 1), self._kernel)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1993, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2532, in _mat_mul
name=name)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3081, in create_op
op_def=op_def)
File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1528, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[500,80]
[[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
[[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]
Code snippet (from models.py)
import tensorflow as tf
from tensorflow.python.layers import core as layers_core
class NMTModel:
def __init__(self, hparams, iterator, mode):
source, target_in, target_out, source_lengths, target_lengths = iterator.get_next()
true_batch_size = tf.size(source_lengths)
# Lookup embeddings
embedding_encoder = tf.get_variable("embedding_encoder", [hparams.src_vsize, hparams.src_emsize])
encoder_emb_inp = tf.nn.embedding_lookup(embedding_encoder, source)
embedding_decoder = tf.get_variable("embedding_decoder", [hparams.tgt_vsize, hparams.tgt_emsize])
decoder_emb_inp = tf.nn.embedding_lookup(embedding_decoder, target_in)
# Build and run Encoder LSTM
encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell, encoder_emb_inp, sequence_length=source_lengths, dtype=tf.float32)
# Build and run Decoder LSTM with Helper and output projection layer
decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
projection_layer = layers_core.Dense(hparams.tgt_vsize, use_bias=False)
# if mode is 'TRAIN' or mode is 'EVAL': # then decode using TrainingHelper
# helper = tf.contrib.seq2seq.TrainingHelper(decoder_emb_inp, sequence_length=target_lengths)
# elif mode is 'INFER': # then decode using Beam Search
# helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)
outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=tf.reduce_max(target_lengths))
logits = outputs.rnn_output
if mode is 'TRAIN' or mode is 'EVAL': # then calculate loss
crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_out, logits=logits)
target_weights = tf.sequence_mask(target_lengths, maxlen=tf.shape(target_out)[1], dtype=logits.dtype)
self.loss = tf.reduce_sum((crossent * target_weights)) / tf.cast(true_batch_size, tf.float32)
if mode is 'TRAIN': # then calculate/clip gradients, then optimize model
params = tf.trainable_variables()
gradients = tf.gradients(self.loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(gradients, hparams.max_gradient_norm)
optimizer = tf.train.AdamOptimizer(hparams.l_rate)
self.update_step = optimizer.apply_gradients(zip(clipped_gradients, params))
if mode is 'EVAL' or mode is 'INFER': # then allow access to input/output tensors to printout
self.src = source
self.tgt = target_out
self.preds = tf.argmax(logits, axis=2)
# Designate a saver operation
self.saver = tf.train.Saver(tf.global_variables())
def train(self, sess):
return sess.run([self.update_step, self.loss])
def eval(self, sess):
return sess.run([self.loss, self.src, self.tgt, self.preds])
def infer(self, sess):
return sess.run([self.src, self.tgt, self.preds]) # tgt should not exist (temporary debugging only)
Batches have variable length, so smaller batches may pass without OOM while larger ones may not.
Depending on your implementation, you can print out the batches lengths (the largest length of the batch such that all other sequences are padded up to that length) and determine whether this is causing your issue.
To fix this, lower your batch size, or set a max length for your iterator.
This is not a memory leak.