Tensorflow ResourceExhaustedError after first batch

Summary and Test Cases

The core issue is that Tensorflow throws OOM allocations on a batch that is not the first, as I would expect. Therefore, I believe there is a memory leak since all memory is clearly not being freed after each batch.

num_units: 50, batch_size: 1000; fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 800, fails OOM (gpu) before 1st batch as expected
num_units: 50, batch_size: 750; fails OOM (gpu) after 10th batch (???)
num_units: 50, batch_size: 500; fails OOM (gpu) after 90th batch (???)
num_units: 50, batch_size: 300; fails OOM (gpu) after 540th batch (???)
num_units: 50, batch_size: 200; computer freezes after around 900 batches with 100% ram use
num_units: 50, batch_size: 100; passes 1 epoch -- may fail later (unknown)

Explanation:

Essentially, it runs 144 batch with a batch size of 500 before failing on the 145th batch, which seems strange. If it can't allocate enough memory for the 145th batch, why should it work for the first 144? The behavior can be replicated.

Note that each batch DOES vary in size, since each one has dimensions [BATCH_SIZE, MAX_SEQUENCE_LENGTH], and depending on the sequences sampled, the sequence length varies, but the program does not fail on the largest batch; it fails later on a smaller one. Therefore, I conclude that a single oversized batch is not causing the memory error; it appears to be a memory leak.

With a larger batch size, the program fails earlier; with a smaller batch size, it fails later.

The full error is here:

  Traceback (most recent call last):
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
     [[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
     [[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/me/IdeaProjects/tf-nmt/main.py", line 89, in <module>
    _ = sess.run([update_step])
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[500,80]
     [[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
     [[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]

Caused by op 'decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul', defined at:
  File "/home/me/IdeaProjects/tf-nmt/main.py", line 49, in <module>
    outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 309, in dynamic_decode
    swap_memory=swap_memory)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2819, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2643, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2593, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/decoder.py", line 254, in body
    decoder_finished) = decoder.step(time, inputs, state)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/contrib/seq2seq/python/ops/basic_decoder.py", line 138, in step
    cell_outputs, cell_state = self._cell(inputs, state)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 290, in __call__
    return base_layer.Layer.__call__(self, inputs, state, scope=scope)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/layers/base.py", line 618, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 567, in call
    array_ops.concat([inputs, h], 1), self._kernel)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 1993, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2532, in _mat_mul
    name=name)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3081, in create_op
    op_def=op_def)
  File "/home/me/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1528, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[500,80]
     [[Node: decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/while/BasicDecoderStep/basic_lstm_cell/concat, decoder/while/BasicDecoderStep/basic_lstm_cell/MatMul/Enter)]]
     [[Node: gradients/Add/_282 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_457_gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopdecoder/while/BasicDecoderStep/TrainingHelperNextInputs/add/y/_181)]]

Code snippet (from models.py)

import tensorflow as tf
from tensorflow.python.layers import core as layers_core


class NMTModel:
    def __init__(self, hparams, iterator, mode):
        source, target_in, target_out, source_lengths, target_lengths = iterator.get_next()
        true_batch_size = tf.size(source_lengths)

        # Lookup embeddings
        embedding_encoder = tf.get_variable("embedding_encoder", [hparams.src_vsize, hparams.src_emsize])
        encoder_emb_inp = tf.nn.embedding_lookup(embedding_encoder, source)
        embedding_decoder = tf.get_variable("embedding_decoder", [hparams.tgt_vsize, hparams.tgt_emsize])
        decoder_emb_inp = tf.nn.embedding_lookup(embedding_decoder, target_in)

        # Build and run Encoder LSTM
        encoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
        encoder_outputs, encoder_state = tf.nn.dynamic_rnn(encoder_cell, encoder_emb_inp, sequence_length=source_lengths, dtype=tf.float32)

        # Build and run Decoder LSTM with Helper and output projection layer
        decoder_cell = tf.nn.rnn_cell.BasicLSTMCell(hparams.num_units)
        projection_layer = layers_core.Dense(hparams.tgt_vsize, use_bias=False)
        # if mode is 'TRAIN' or mode is 'EVAL':  # then decode using TrainingHelper
        #     helper = tf.contrib.seq2seq.TrainingHelper(decoder_emb_inp, sequence_length=target_lengths)
        # elif mode is 'INFER':  # then decode using Beam Search
        #     helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
        helper = tf.contrib.seq2seq.GreedyEmbeddingHelper(embedding_decoder, tf.fill([true_batch_size], hparams.sos), hparams.eos)
        decoder = tf.contrib.seq2seq.BasicDecoder(decoder_cell, helper, encoder_state, output_layer=projection_layer)
        outputs, _, _ = tf.contrib.seq2seq.dynamic_decode(decoder, maximum_iterations=tf.reduce_max(target_lengths))
        logits = outputs.rnn_output

        if mode is 'TRAIN' or mode is 'EVAL':  # then calculate loss
            crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=target_out, logits=logits)
            target_weights = tf.sequence_mask(target_lengths, maxlen=tf.shape(target_out)[1], dtype=logits.dtype)
            self.loss = tf.reduce_sum((crossent * target_weights)) / tf.cast(true_batch_size, tf.float32)

        if mode is 'TRAIN':  # then calculate/clip gradients, then optimize model
            params = tf.trainable_variables()
            gradients = tf.gradients(self.loss, params)
            clipped_gradients, _ = tf.clip_by_global_norm(gradients, hparams.max_gradient_norm)

            optimizer = tf.train.AdamOptimizer(hparams.l_rate)
            self.update_step = optimizer.apply_gradients(zip(clipped_gradients, params))

        if mode is 'EVAL' or mode is 'INFER':  # then allow access to input/output tensors to printout
            self.src = source
            self.tgt = target_out
            self.preds = tf.argmax(logits, axis=2)

        # Designate a saver operation
        self.saver = tf.train.Saver(tf.global_variables())

    def train(self, sess):
        return sess.run([self.update_step, self.loss])

    def eval(self, sess):
        return sess.run([self.loss, self.src, self.tgt, self.preds])

    def infer(self, sess):
        return sess.run([self.src, self.tgt, self.preds])  # tgt should not exist (temporary debugging only)

Solution

Batches have variable length, so smaller batches may pass without OOM while larger ones may not.

Depending on your implementation, you can print out the batches lengths (the largest length of the batch such that all other sequences are padded up to that length) and determine whether this is causing your issue.

To fix this, lower your batch size, or set a max length for your iterator.

This is not a memory leak.