Search code examples
pythonpytorchconv-neural-networkloss-function

Training loss is increasing in CNN?


I am in the process of training my first CNN to solve a multi-class classification problem. I am feeding in images of animals corresponding to one of 182 classes, however I have ran into some issues. Firstly my code appears to get stuck on optimiser.step(), it has been calculating this for roughly 30 minutes. Secondly my training loss is increasing:

EPOCH: 0 BATCH: 1999 LOSS: 1.5790680234357715
EPOCH: 0 BATCH: 3999 LOSS: 2.9340945997834207

If any one would be able to provide some guidance that would be greatly appreciated. Below is my code

#loading data
train_data = dataset.get_subset(
    "train",
    transform=transforms.Compose(
        [transforms.Resize((448, 448)), transforms.ToTensor()]
    ),
)

train_loader = get_train_loader("standard", train_data, batch_size=16)
#definind model
class ConvNet(nn.Module):

  def __init__(self):
    super(ConvNet, self).__init__()
    self.conv1 = nn.Conv2d(3, 6, 3, 1)
    self.conv2 = nn.Conv2d(6, 16, 3, 3)
    self.fc1 = nn.Linear(37*37*16, 120)
    self.fc2 = nn.Linear(120, 84)
    self.fc3 = nn.Linear(84, 182)

  def forward(self, X):
    X = F.relu(self.conv1(X))
    X = F.max_pool2d(X, 2, 2)
    X = F.relu(self.conv2(X))
    X = F.max_pool2d(X, 2, 2)
    X = torch.flatten(X, 1)
    X = F.relu(self.fc1((X)))
    X = F.relu(self.fc2((X)))
    X = self.fc3(X)
    return F.log_softmax(X, dim=1)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(modell.parameters(), lr=0.001)

import time

start_time = time.time()

#VARIABLES  (TRACKER)
epochs = 2
train_losses = []
test_losses = []
train_correct = []
test_correct = []

# FOR LOOP EPOCH
for i in range(epochs):
  trn_corr = 0
  tst_corr = 0

  running_loss = 0.0
  #TRAIN
  for b, (X_train, Y_train, meta) in enumerate(train_loader):
    
    b+=1 #batch starts at 1

    #zero parameter gradients
    optimizer.zero_grad()

    # pass training to model as float (later compute loss)
    output = modell(X_train.float())

    #Calculate the loss of outputs with respect to ground truth values
    loss = criterion(output, Y_train)

    #Backpropagate the loss through the network
    loss.backward()

    #perform parameter update based on the current gradient
    optimizer.step()

    predicted = torch.max(output.data, 1)[1]


    batch_corr = (predicted == Y_train).sum() # True (1) or False (0)
    trn_corr += batch_corr

    running_loss += loss.item()

    if b%2000 == 1999:
      print(f"EPOCH: {i} BATCH: {b} LOSS: {running_loss/2000}")
      running_loss = 0.0

train_losses.append(loss)
train_correct.append(trn_corr)


Solution

  • As for the loss, it may be due to the model. The model has some rooms for improvement. Only 2 convolution layers is not sufficient for your data, as well as only expanding to 16 channels. Use more convolution layers with more channels. For example, 5 conv layers with channels of 16, 32, 32, 64, 64. Experiment with different numbers of layers and channels to see which one is best. Also, a good learning rate for Adam is 3e-4.To more easily track the models progress, I’d recommend decreasing the interval at which it prints the loss so you can more easily track progress. About the data, are there enough instances of each class? Is it normalized between 0 and 1?