So I try to use multiple GPUs with Keras. When I run training_utils.py with the example program (given as comments inside the training_utils.py code), I end up with ResourceExhaustedError
. nvidia-smi
tells me that barely one of the four GPUs are working. Using one GPU works fine for other programs.
Question: Anyone have any idea whats going on here?
Console output:
(...)
2017-10-26 14:39:02.086838: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ***************************************************************************************************x 2017-10-26 14:39:02.086857: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[128,55,55,256] Traceback (most recent call last): File "test.py", line 27, in parallel_model.fit(x, y, epochs=20, batch_size=256) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1631, in fit validation_steps=validation_steps) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/training.py", line 1213, in _fit_loop outs = f(ins_batch) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2331, in call **self.session_kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
Caused by op u'replica_1/xception/block3_sepconv2/separable_conv2d', defined at: File "test.py", line 19, in parallel_model = multi_gpu_model(model, gpus=2) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 143, in multi_gpu_model outputs = model(inputs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 603, in call output = self.call(inputs, **kwargs) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2061, in call output_tensors, _, _ = self.run_internal_graph(inputs, masks) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/engine/topology.py", line 2212, in run_internal_graph output_tensors = _to_list(layer.call(computed_tensor, **kwargs)) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 1221, in call dilation_rate=self.dilation_rate) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 3279, in separable_conv2d data_format=tf_data_format) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/nn_impl.py", line 497, in separable_conv2d name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 397, in conv2d data_format=data_format, name=name) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/kyb/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[128,55,55,256] [[Node: replica_1/xception/block3_sepconv2/separable_conv2d = Conv2D[T=DT_FLOAT, data_format="NHWC", padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/gpu:1"](replica_1/xception/block3_sepconv2/separable_conv2d/depthwise, block3_sepconv2/pointwise_kernel/read/_2103)]] [[Node: training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter/_4511 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_25380_training/RMSprop/gradients/replica_0/xception/block10_sepconv2/separable_conv2d_grad/Conv2DBackpropFilter", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]]
EDIT (Added example code):
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 100
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
parallel_model.fit(x, y, epochs=20, batch_size=128)
When encountering OOM/ResourceExhaustedError on GPU I believe changing (Reducing) batch size
is the right option to try at first.
For different GPU you may need different batch size based on the GPU memory you have.
Recently I faced the similar type of problem, tweaked a lot to do the different type of experiment.
Here is the link to the question (also some tricks are included).
However, while reducing the size of the batch you may find that your training gets slower.