Batch size and Training time

Thank you for @Prune's critical comments on my questions.

I am trying to find the relationship between batch size and training time by using MNIST dataset.

By reading numerous questions in stackoverflow, such as this one: How does batch size impact time execution in neural networks? people said that the training time will be decreased when I use small batch size.

However, by trying out these two, I found that training with batch size == 1 takes way more time than batch size == 60,000. I set epoch as 10.

I split my MMIST dataset into 60k for the training and 10k for the testing.

This below is my code and results.

mnist_trainset = torchvision.datasets.MNIST(root=root_dir, train=True, 
                                download=True, 
                                transform=transforms.Compose([transforms.ToTensor()]))

mnist_testset  = torchvision.datasets.MNIST(root=root_dir, 
                                train=False, 
                                download=True, 
                                transform=transforms.Compose([transforms.ToTensor()]))

train_dataloader = torch.utils.data.DataLoader(mnist_trainset, 
                                               batch_size=1, 
                                               shuffle=True)

test_dataloader  = torch.utils.data.DataLoader(mnist_testset, 
                                               batch_size=50, 
                                               shuffle=False)

# Define the model 
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.linear_1 = torch.nn.Linear(784, 256)
        self.linear_2 = torch.nn.Linear(256, 10)
        self.sigmoid  = torch.nn.Sigmoid()

    def forward(self, x):
        x = x.reshape(x.size(0), -1)
        x = self.linear_1(x)
        x = self.sigmoid(x)
        pred = self.linear_2(x)

        return pred

# trainer 
no_epochs = 10

def my_trainer(optimizer, model):

    criterion = torch.nn.CrossEntropyLoss()

    train_loss = list()
    test_loss  = list()
    test_acc   = list()
    best_test_loss = 1

    for epoch in range(no_epochs):
        
        # timer starts 
        start = timer()

        total_train_loss = 0
        total_test_loss = 0

        # training
        # set up training mode 
        model.train()
    
        for itr, (image, label) in enumerate(train_dataloader):

            optimizer.zero_grad()

            pred = model(image)

            loss = criterion(pred, label)
            total_train_loss += loss.item()

            loss.backward()
            optimizer.step()

        total_train_loss = total_train_loss / (itr + 1)
        train_loss.append(total_train_loss)

        # testing 
        # change to evaluation mode 
        model.eval()
        total = 0
        for itr, (image, label) in enumerate(test_dataloader):

            pred = model(image)

            loss = criterion(pred, label)
            total_test_loss += loss.item()

            # we now need softmax because we are testing.
            pred = torch.nn.functional.softmax(pred, dim=1)
            for i, p in enumerate(pred):
                if label[i] == torch.max(p.data, 0)[1]:
                    total = total + 1

        # caculate accuracy 
        accuracy = total / len(mnist_testset)

        # append accuracy here
        test_acc.append(accuracy)

        # append test loss here 
        total_test_loss = total_test_loss / (itr + 1)
        test_loss.append(total_test_loss)

        print('\nEpoch: {}/{}, Train Loss: {:.8f}, Test Loss: {:.8f}, Test Accuracy: {:.8f}'.format(epoch + 1, no_epochs, total_train_loss, total_test_loss, accuracy))

        if total_test_loss < best_test_loss:
            best_test_loss = total_test_loss
            print("Saving the model state dictionary for Epoch: {} with Test loss: {:.8f}".format(epoch + 1, total_test_loss))
            torch.save(model.state_dict(), "model.dth")

        # timer finishes 
        end = timer()
        print(end - start)

    return no_epochs, test_acc, test_loss

model_sgd = Model()
optimizer_SGD = torch.optim.SGD(model_sgd.parameters(), lr=0.1)
sgd_no_epochs,      sgd_test_acc, sgd_test_loss           = my_trainer(optimizer=optimizer_SGD, model=model_sgd)

I calculated how much time did it took for each epoch.

And this below is the result.

Epoch: 1/10, Train Loss: 0.23193890, Test Loss: 0.12670580, Test Accuracy: 0.96230000
63.98903721500005 seconds

Epoch: 2/10, Train Loss: 0.10275097, Test Loss: 0.10111042, Test Accuracy: 0.96730000
63.97179028100004 seconds

Epoch: 3/10, Train Loss: 0.07269370, Test Loss: 0.09668248, Test Accuracy: 0.97150000
63.969843954 seconds

Epoch: 4/10, Train Loss: 0.05658571, Test Loss: 0.09841745, Test Accuracy: 0.97070000
64.24135530400008 seconds

Epoch: 5/10, Train Loss: 0.04183391, Test Loss: 0.09828428, Test Accuracy: 0.97230000
64.19695308500013 seconds

Epoch: 6/10, Train Loss: 0.03393899, Test Loss: 0.08982467, Test Accuracy: 0.97530000
63.96944059600014 seconds

Epoch: 7/10, Train Loss: 0.02808819, Test Loss: 0.08597597, Test Accuracy: 0.97700000
63.59837343000004 seconds

Epoch: 8/10, Train Loss: 0.01859330, Test Loss: 0.07529452, Test Accuracy: 0.97950000
63.591578820999985 seconds

Epoch: 9/10, Train Loss: 0.01383720, Test Loss: 0.08568452, Test Accuracy: 0.97820000
63.66664020100029

Epoch: 10/10, Train Loss: 0.00911216, Test Loss: 0.07377760, Test Accuracy: 0.98060000
63.92636473799985 seconds

After this I changed the batch size to 60000 and run the same program again.

train_dataloader = torch.utils.data.DataLoader(mnist_trainset, 
                                               batch_size=60000, 
                                               shuffle=True)

test_dataloader  = torch.utils.data.DataLoader(mnist_testset, 
                                               batch_size=50, 
                                               shuffle=False)

print("\n===== Entering SGD optimizer =====\n")
model_sgd = Model()
optimizer_SGD = torch.optim.SGD(model_sgd.parameters(), lr=0.1)
sgd_no_epochs,      sgd_test_acc, sgd_test_loss           = my_trainer(optimizer=optimizer_SGD, model=model_sgd)

I got this result for batch size == 60000

Epoch: 1/10, Train Loss: 2.32325006, Test Loss: 2.30074144, Test Accuracy: 0.11740000
6.54154992299982 seconds

Epoch: 2/10, Train Loss: 2.30010080, Test Loss: 2.29524792, Test Accuracy: 0.11790000
6.341824101999919 seconds

Epoch: 3/10, Train Loss: 2.29514933, Test Loss: 2.29183527, Test Accuracy: 0.11410000
6.161918789000083 seconds

Epoch: 4/10, Train Loss: 2.29196787, Test Loss: 2.28874513, Test Accuracy: 0.11450000
6.180891567999879 seconds

Epoch: 5/10, Train Loss: 2.28899717, Test Loss: 2.28571669, Test Accuracy: 0.11570000
6.1449509030003355 seconds

Epoch: 6/10, Train Loss: 2.28604794, Test Loss: 2.28270152, Test Accuracy: 0.11780000
6.311743144000047 seconds

Epoch: 7/10, Train Loss: 2.28307867, Test Loss: 2.27968731, Test Accuracy: 0.12250000
6.060618773999977 seconds

Epoch: 8/10, Train Loss: 2.28014660, Test Loss: 2.27666961, Test Accuracy: 0.12890000
6.171511712999745 seconds

Epoch: 9/10, Train Loss: 2.27718973, Test Loss: 2.27364607, Test Accuracy: 0.13930000
6.164125173999764 seconds

Epoch: 10/10, Train Loss: 2.27423453, Test Loss: 2.27061504, Test Accuracy: 0.15350000
6.077817454000069 seconds

As you can see it is clear that it took more time for each epoch when batch_size == 1 which is different from what I have seen.

Maybe I am confused with the training time per epoch vs the training time until convergence? Seems like my intuition is correct by looking at this webpage: https://medium.com/deep-learning-experiments/effect-of-batch-size-on-neural-net-training-c5ae8516e57

Can someone please explain what is happening?

Solution

This is a borderline question; you should still be able to extract this understanding from the basic literature ... eventually.

Your insight is exactly correct: you are measuring execution time per epoch, rather than total Time-to-Train (TTT). You have also carried the generic "smaller batches" advice ad absurdum: a batch size of 1 is almost guaranteed to be sub-optimal.

The mechanics are very simple at a macro level.

With a batch size of 60k (the entire training set), you run all 60k images through the model, average their results, and then do one back-propagation for that average result. This tends to lose the learning you can get from focusing on little-seen features.

With a batch size of 1, you run each image individually through the model, average the one result (a very simple operation :-) ), and do a back propagation. This tends to over-emphasize individual effects, especially retaining superstitious effects from each single image. It also gives too much weight to the initial assumptions of the first few images.

The most obvious effect of the tiny batch size is that you're doing 60k back-props instead of 1, so each epoch takes much longer.

Either of these approaches is an extreme case, usually absurd in application.

You need to experiment to find the "sweet spot" that gives you the fastest convergence to acceptable (near-optimal) accuracy. There are a few considerations in choosing your experimental design:

Memory size: you want to be able to ingest the entire batch into memory at once. This allows your model to pipeline reading and processing. If you exceed available memory, you will lose a lot of time to swapping. If you under-use the memory, you leave some potential performance untapped.
Processors: if you're on a multi-processor chip, you want to keep them all busy. If you care to assign processors through your OS controls, you'll also want to play with how many to assign to model computation, and how many to assign to I/O and system use. For instance, in one project I did, our group found that our 32 cores were best used with 28 allocated to computation, 4 reserved for I/O and other system functions.
Scaling: some characteristics work best in powers of 2. You may find that a batch size that is 2^n or 3 * 2^n for some n, works best, simply because of block sizes and other system allocations.

The experimental design that has worked best for me over the years is to start with a power of 2 that is roughly the square root of the training set size. For you, there's an obvious starting guess of 256. Thus, you'd run experiments at perhaps 64, 128, 256, 512, and 1024. See which ones give you the fastest convergence.

Then do one step of refinement, using that factor of 3. For instance, if you find that the best performance comes at 128, also try 96 and 192.

You will likely see very little difference between your "sweet spot" and the adjacent batch sizes; this is the nature of most complex information systems.