deep-learning pytorch lstm recurrent-neural-network torchtext

Having trouble with input dimensions for Pytorch LSTM with torchtext

Problem

I'm trying to build a text classifier network using LSTM. The error I'm getting is:

RuntimeError: Expected hidden[0] size (4, 600, 256), got (4, 64, 256)

Details

The data is json and looks like this:

{"cat": "music", "desc": "I'm in love with the song's intro!", "sent": "h"}

I'm using torchtext to load the data.

from torchtext import data
from torchtext import datasets

TEXT = data.Field(fix_length = 600)
LABEL = data.Field(fix_length = 10)

BATCH_SIZE = 64

fields = {
    'cat': ('c', LABEL),
    'desc': ('d', TEXT),
    'sent': ('s', LABEL),
}

My LSTM looks like this

EMBEDDING_DIM = 64
HIDDEN_DIM = 256
N_LAYERS = 4

MyLSTM(
  (embedding): Embedding(11967, 64)
  (lstm): LSTM(64, 256, num_layers=4, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=8, bias=True)
  (sig): Sigmoid()
)

I end up with the following dimensions for the inputs and labels

batch = list(train_iterator)[0]
inputs, labels = batch
print(inputs.shape) # torch.Size([600, 64])
print(labels.shape) # torch.Size([100, 2, 64])

And my initialized hidden tensor looks like:

hidden # [torch.Size([4, 64, 256]), torch.Size([4, 64, 256])]

Question

I'm trying to understand what the dimensions at each step should be. Should the hidden dimension be initialized to (4, 600, 256) or (4, 64, 256)?

Solution

The documentation of nn.LSTM - Inputs explains what the dimensions are:

h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. If the LSTM is bidirectional, num_directions should be 2, else it should be 1.

Therefore, your hidden state should have size (4, 64, 256), so you did that correctly. On the other hand, you are not providing the correct size for the input.

input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.

While it says that the size of the input needs to be (seq_len, batch, input_size), you've set batch_first=True in your LSTM, which swaps batch and seq_len. Therefore your input should have size (batch_size, seq_len, input_size), but that is not the case as your input has seq_len first (600) and batch second (64), which is the default in torchtext because that's the more common representation, which also matches the default behaviour of LSTM.

You need to set batch_first=False in your LSTM.

Alternatively. if you prefer having batch as the first dimension in general, torch.data.Field also has the batch_first option.