python tensorflow deep-learning distributed-computing multi-gpu

Use multiple GPUs for inception_v3 model in TF slim

I am trying to train a slim model using 3 GPUs.

I specifically telling TF to use the second GPU to allocate the model:

with tf.device('device:GPU:1'):
    logits, end_points = inception_v3(inputs)

However, I'm getting an OOM error on that GPU everytime I run my code. I've tried to reduce the batch_size so the model fits in memory, but the net is ruinned.

I own 3 GPUS so, is there a way to tell TF to use my third GPU when second is full? I've tried not telling TF to use any GPU and allowing soft placemente, but it is not working either.

Solution

This statement with tf.device('device:GPU:1') tells tensorflow specifically to use GPU-1, so it won't attempt to use any other device you have.

When the model is too big, the recommended way is to use model parallelism via manually splitting your graph into different GPUs. The complication in your case is that the model definition is in the library, so you can't insert tf.device statements for different layers unless you patch tensorflow.

But there is a workaround

You can define and place variables before invoking inception_v3 builder. This way inception_v3 will reuse these variables and not change its placement. Example:

with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
  with tf.device('device:GPU:1'):
    tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/biases", shape=[1000])
    tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/weights", shape=[1, 1, 2048, 1000])

  with tf.device('device:GPU:0'):
    logits, end_points = inception_v3(inputs)

Upon running, you'll see that all variables except Conv2d_1c_1x1 are placed onto GPU-0, while Conv2d_1c_1x1 layer is on GPU-1.

The drawback is that you need to know the shape of each variable you want to replace. But it is doable and at least can get your model running.