Search code examples

Trouble Understanding ResNet Implementation

I'm having trouble understanding and replicating the original implementation of ResNet on the CIFAR-10 dataset, as described in the paper "Deep Residual Learning for Image Recognition". Specifically, I have a few questions about the following passage:

We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.

  1. What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?

  2. How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?

  3. How can I implement the training and learning rate scheduling in PyTorch? Since 45000 isn't divisible by 128, should I drop the last 72 images every epoch? Also, since the 32k, 48k, and 64k milestones don't fall on a whole number of epochs, should I round them to the nearest epochs? Or is there a way to change the learning rate and terminate training in the middle of an epoch?

If anyone could point me in the right direction, I greatly appreciate it. I'm new to deep learning, so thank you for your help and kind understanding.


    1. What does a minibatch size of 128 on two GPUs entail? Does this mean the batch size per GPU is 64?

    When running two GPUs on the same machine then the batch size is split between the GPUs, as you've said. The gradient produced by both GPUs will be transfered, averaged and applied on one of the GPUs, or possibly on the CPU.

    Here's more info:

    1. How can I convert from iterations to epochs? Is the model trained for 64000 * 128/45000 = 182.04 epochs?

    I encourage everyone to think in terms of iterations rather than epochs. Each iteration equates to a single weight update, which is much more relevant to model convergence than an epoch is. If you think in epochs you have to adjust the number of epochs of training every time you try a different batch size. This isn't the case if you use think in terms of iterations (aka training steps, or weight updates). But your formula is correct in computing epochs.

    1. How can I implement the training and learning rate scheduling in PyTorch?

    I think this pytorch post answers the question, it looks like this was added to pytorch (sorry for a non authoritative answer here, I'm more familiar with Tensorflow):

    You can also just use epochs of course, and adjusting the learning rate doesn't have to happen exactly at the same point as the paper describes, as near as you can reasonably get with rounding error will work just fine.