Search code examples
deep-learningpytorchcross-entropy

nn.BCELoss() is giving very bad results unlike BCEWithLogitsLoss


I am trying to build a PyTorch classification program with tabular dataset, my model has the following architecture:

BATCH_SIZE = 8 
EPOCHS = 10
HIDDEN_NEURONS = 25
LR = 1e-3

class MyModel(nn.Module):
def __init__(self):

    super(MyModel, self).__init__()

    self.input_layer = nn.Linear(X.shape[1], HIDDEN_NEURONS)
    self.linear = nn.Linear(HIDDEN_NEURONS, 1)
    self.sigmoid = nn.Sigmoid()

def forward(self, x):
    x = self.input_layer(x)
    x = self.linear(x)
    x = self.sigmoid(x)

    return x

The model is pretty simple and small. The model has the following training loop:

total_loss_train_plot = []
total_loss_validation_plot = []
total_acc_train_plot = []
total_acc_validation_plot = []

for epoch in range(EPOCHS):
    total_acc_train = 0
    total_loss_train = 0
    total_acc_val = 0
    total_loss_val = 0
    ## Training and Validation
    for indx, data in enumerate(train_dataloader):
        input, label = data

        input.to(device)
        label.to(device)

        prediction = model(input).squeeze(1)
        #print(prediction)
        batch_loss = criterion(prediction, label)

        total_loss_train += batch_loss.item()

        acc = ((prediction).round() == label).sum().item()

        total_acc_train += acc

        batch_loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    ## Validation
    with torch.no_grad():
        for indx, data in enumerate(validation_dataloader):
            input, label = data
            input.to(device)
            label.to(device)

            prediction = model(input).squeeze(1)

            batch_loss = criterion(prediction, label)

            total_loss_train += batch_loss.item()

            acc = ((prediction).round() == label).sum().item()

            total_acc_val += acc

    total_loss_train_plot.append(round(total_loss_train/1000, 4))
    total_loss_validation_plot.append(round(total_loss_val/1000, 4))
    total_acc_train_plot.append(round(total_acc_train/(training_data.__len__())*100, 4))
    total_acc_validation_plot.append(round(total_acc_val/(validation_data.__len__())*100, 4))

    print(f'''Epoch no. {epoch + 1} Train Loss: {total_loss_train/1000:.4f} Train Accuracy: {(total_acc_train/(training_data.__len__())*100):.4f} Validation Loss: {total_loss_val/1000:.4f} Validation Accuracy: {(total_acc_val/(validation_data.__len__())*100):.4f}''')
    print("="*50)

The loss and accuracy aren't improving and they are staying constant:

Epoch no. 1 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 2 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 3 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 4 Train Loss: 105.8375 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 5 Train Loss: 105.8375 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 6 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 7 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 8 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 9 Train Loss: 105.9250 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================
Epoch no. 10 Train Loss: 105.8375 Train Accuracy: 44.9603 Validation Loss: 0.0000 Validation Accuracy: 46.4443
==================================================

But when I changed the loss to BCEWithLogitsLoss and removed the sigmoid layer, the training improved and worked fine by reducing loss and increasing the accuracy. When I change the loss to the loss with logits I get these results:

Epoch no. 1 Train Loss: 0.7597 Train Accuracy: 96.7476 Validation Loss: 0.0000 Validation Accuracy: 98.8270
==================================================
Epoch no. 2 Train Loss: 0.9141 Train Accuracy: 96.2841 Validation Loss: 0.0000 Validation Accuracy: 98.6070
==================================================
Epoch no. 3 Train Loss: 0.6364 Train Accuracy: 97.2189 Validation Loss: 0.0000 Validation Accuracy: 98.1305
==================================================
Epoch no. 4 Train Loss: 0.7539 Train Accuracy: 96.5748 Validation Loss: 0.0000 Validation Accuracy: 98.8270
==================================================
Epoch no. 5 Train Loss: 0.8025 Train Accuracy: 96.6062 Validation Loss: 0.0000 Validation Accuracy: 96.8109
==================================================
Epoch no. 6 Train Loss: 0.6069 Train Accuracy: 96.8340 Validation Loss: 0.0000 Validation Accuracy: 98.9370
==================================================
Epoch no. 7 Train Loss: 0.6626 Train Accuracy: 96.8261 Validation Loss: 0.0000 Validation Accuracy: 96.2977
==================================================
Epoch no. 8 Train Loss: 0.5833 Train Accuracy: 96.6140 Validation Loss: 0.0000 Validation Accuracy: 98.6804
==================================================
Epoch no. 9 Train Loss: 0.4303 Train Accuracy: 97.3604 Validation Loss: 0.0000 Validation Accuracy: 98.2405
==================================================
Epoch no. 10 Train Loss: 0.5376 Train Accuracy: 97.0225 Validation Loss: 0.0000 Validation Accuracy: 96.9208
==================================================

I know the difference between the two functions. One of them accepts only probabilities (BCELoss) after the sigmoid and the other with the logits before the sigmoid. But why the network behave like this on changing both functions ? I used to do Bert for binary text classification with BCELoss and worked perfectly fine. Any explaination on this?


Solution

  • I found the problem, I had to normalize my data. I normalized the data using the following code and worked very fine:

    for column in data_df.columns:
        data_df[column] = data_df[column]/data_df[column].abs().max()
    data_df.head()