Search code examples
machine-learningpytorchrecurrent-neural-networksentiment-analysis

Constant Training Loss and Validation Loss


I am running a RNN model with Pytorch library to do sentiment analysis on movie review, but somehow the training loss and validation loss remained constant throughout the training. I have looked up different online sources but still stuck.

Can someone please help and take a look at my code?

Some parameters are specified by the assignment:

embedding_dim = 64

n_layers = 1

n_hidden = 128

dropout = 0.5

batch_size = 32

My main code

txt_field = data.Field(tokenize=word_tokenize, lower=True, include_lengths=True, batch_first=True)
label_field = data.Field(sequential=False, use_vocab=False, batch_first=True)

train = data.TabularDataset(path=part2_filepath+"train_Copy.csv", format='csv',
                            fields=[('label', label_field), ('text', txt_field)], skip_header=True)
validation = data.TabularDataset(path=part2_filepath+"validation_Copy.csv", format='csv',
                            fields=[('label', label_field), ('text', txt_field)], skip_header=True)

txt_field.build_vocab(train, min_freq=5)
label_field.build_vocab(train, min_freq=2)

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train, validation, test),
    batch_size=32,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
    device=device)

n_vocab = len(txt_field.vocab)
embedding_dim = 64
n_hidden = 128
n_layers = 1
dropout = 0.5

model = Text_RNN(n_vocab, embedding_dim, n_hidden, n_layers, dropout)

optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = torch.nn.BCELoss().to(device)

N_EPOCHS = 15
best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    train_loss, train_acc = RNN_train(model, train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iter, criterion)

My Model

class Text_RNN(nn.Module):
    def __init__(self, n_vocab, embedding_dim, n_hidden, n_layers, dropout):
        super(Text_RNN, self).__init__()
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.emb = nn.Embedding(n_vocab, embedding_dim)
        self.rnn = nn.RNN(
            input_size=embedding_dim,
            hidden_size=n_hidden,
            num_layers=n_layers,
            dropout=dropout,
            batch_first=True
        )
        self.sigmoid = nn.Sigmoid()
        self.linear = nn.Linear(n_hidden, 2)

    def forward(self, sent, sent_len):
        sent_emb = self.emb(sent)
        outputs, hidden = self.rnn(sent_emb)
        prob = self.sigmoid(self.linear(hidden.squeeze(0)))

        return prob

The training function

def RNN_train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        text, text_lengths = batch.text
        predictions = model(text, text_lengths)
        batch.label = batch.label.type(torch.FloatTensor).squeeze()
        predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
        loss = criterion(predictions, batch.label)
        loss.requires_grad = True
        acc = binary_accuracy(predictions, batch.label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The output I run on 10 testing reviews + 5 validation reviews

Epoch [1/15]:   Train Loss: 15.351 | Train Acc: 44.44%  Val. Loss: 11.052 |  Val. Acc: 60.00%
Epoch [2/15]:   Train Loss: 15.351 | Train Acc: 44.44%  Val. Loss: 11.052 |  Val. Acc: 60.00%
Epoch [3/15]:   Train Loss: 15.351 | Train Acc: 44.44%  Val. Loss: 11.052 |  Val. Acc: 60.00%
Epoch [4/15]:   Train Loss: 15.351 | Train Acc: 44.44%  Val. Loss: 11.052 |  Val. Acc: 60.00%
...

Appreciate if someone can point me to the right direction, I believe is something with the training code, since for most parts I follow this article: https://www.analyticsvidhya.com/blog/2020/01/first-text-classification-in-pytorch/


Solution

  • In your training loop you are using the indices from the max operation, which is not differentiable, so you cannot track gradients through it. Because it is not differentiable, everything afterwards does not track the gradients either. Calling loss.backward() would fail.

    # The indices of the max operation are not differentiable
    predictions = torch.max(predictions.data, 1).indices.type(torch.FloatTensor)
    loss = criterion(predictions, batch.label)
    # Setting requires_grad to True to make .backward() work, although incorrectly.
    loss.requires_grad = True
    

    Presumably you wanted to fix that by setting requires_grad, but that does not do what you expect, because no gradients are propagated to your model, since the only thing in your computational graph would be the loss itself, and there is nowhere to go from there.

    You used the indices to get either 0 or 1, since the output of your model is essentially two classes, and you wanted the one with the higher probability. For the Binary Cross Entropy loss, you only need one class that has a value between 0 and 1 (continuous), which you get by applying the sigmoid function.

    So you need change the output channels of the final linear layer to 1:

    self.linear = nn.Linear(n_hidden, 1)
    

    and in your training loop you can remove the torch.max call and also the requires_grad.

    # Squeeze the model's output to get rid of the single class dimension
    predictions = model(text, text_lengths).squeeze()
    batch.label = batch.label.type(torch.FloatTensor).squeeze()
    loss = criterion(predictions, batch.label)
    acc = binary_accuracy(predictions, batch.label)
    optimizer.zero_grad()
    loss.backward()
    

    Since you have only 1 class at the end, an actual prediction would be either 0 or 1 (nothing in between), to achieve that you can simply use 0.5 as the threshold, so everything below is considered a 0 and everything above is considered a 1. If you are using the binary_accuracy function of the article you were following, that is done automatically for you. They do that by rounding it with torch.round.