python machine-learning deep-learning pytorch lstm

Seq2Seq LSTM not learning properly

I am trying to solve a seq-to-seq problem with a LSTM in Pytorch. Concretely, I am taking sequences of 5 elements, to predict the next 5 ones. My concern has to do with the data transformations. I have tensors of size [bs, seq_length, features], where seq_length = 10, and features = 1. Each feature is an int number between 0 and 3 (both included).

I believed input data had to be transformed to float range [0, 1] with a MinMaxScaler, in order to make the LSTM learning process easier. After that, I apply a Linear layer, which transforms the hidden states into the corresponding output, with size features. My definition of the LSTM network in Pytorch:

class LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers, dropout_prob):
        super(LSTM, self).__init__()
        self.lstm_layer = nn.LSTM(input_dim, hidden_dim, num_layers, dropout=dropout_prob)
        self.output_layer = nn.Linear(hidden_dim, output_dim)

    ...

    def forward(self, X):
        out, (hidden, cell) = self.lstm_layer(X)
        out = self.output_layer(out)
        return out

The code I use to do the training loop is the following:

def train_loop(t, checkpoint_epoch, dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, X in enumerate(dataloader):
        X = X[0].type(torch.float).to(device)

        # X = torch.Size([batch_size, 10, input_dim])
        # Split sequences into input and target
        inputs = transform(X[:, :5, :]) # inputs = [batch_size, 5, input_dim]
        targets = transform(X[:, 5:, :]) # targets = [batch_size, 5, input_dim]

        # predictions (forward pass)
        with autocast():
            pred = model(inputs)  # pred = [batch_size, 5, input_dim]
            loss = loss_fn(pred, targets)

        # backprop
        optimizer.zero_grad()
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        if batch % 100 == 0:
            loss, current = loss.item(), batch * len(X)
            #print(f"Current loss: {loss:>7f}, [{current:>5d}/{size:>5d}]")

        # Delete variables and empty cache
        del X, inputs, targets, pred
        torch.cuda.empty_cache()

    return loss

The code I used for preprocessing the data:

def main():
    num_agents = 2
    # Open the HDF5 file
    with h5py.File('dataset_' + str(num_agents) + 'UAV.hdf5', 'r') as f:
        # Access the dataset
        data = f['data'][:]
        # Convert to PyTorch tensor
        data_tensor = torch.tensor(data)

        size = data_tensor.size()
        seq_length = 10
        reshaped = data_tensor.view(-1, size[2], size[3])

        r_size = reshaped.size()
        reshaped = reshaped[:, :, 1:]
        reshaped_v2 = reshaped.view(r_size[0], -1)

        dataset = create_dataset(reshaped_v2.numpy(), seq_length)

        f.close()

    dataset = TensorDataset(dataset)

    # Split the dataset into training and validation sets
    train_size = int(0.8 * len(dataset))  # 80% for training
    val_size = len(dataset) - train_size  # 20% for validation
    train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

    train_dataloader = DataLoader(train_dataset, batch_size=params['batch_size'], shuffle=True, pin_memory=True)
    val_dataloader = DataLoader(val_dataset, batch_size=params['batch_size'], shuffle=False, pin_memory=True)

Trying this, the model was not learning properly, so I was thinking that maybe directly calculating the loss between targets (float values in range [0, 1]) and pred (what I believe are float values in range [-1, 1] because of tanh activation functions from LSTM layer), with different scales might be wrong. Then, I tried aplying a sigmoid activation function right after the Linear layer in the forward pass, but wasn't learning properly neither. I tried executions with many hyperparameter combinations, but none resulted in a "normal" training curve. I also attach a screenshot for 5000 epochs to illustrate the training process:

My questions are:

What seems to be wrong in my training process?
Is there anything I said that is thought in a wrong way?

Solution

The big problem with your code is how you define your LSTM layer.

nn.LSTM by default expects an input of shape (sl, bs, features), while your input is of shape (bs, sl, features). As a result, your current code is processing along the wrong dimension. You need to pass batch_first=True to nn.LSTM to use a batch first input (lstm docs).

Additionally, your data setup is flawed. Your LSTM processes sequences one item at a time, meaning a sequence element i has only seen sequence elements 0, 1, ... i-1. But you expect this sequence element to predict i+5 in the output sequence.

A More concrete example, the 2nd item in your input is expected to predict the 2nd item in the output (7th item overall), without seeing any of the intermediate sequence items. You're trying to predict elements of the output sequence with only a fraction of the input sequence.

The best approach would be to use next step prediction instead. This works as each step has seen all steps before it, and there's no information gap to the step being predicted. It's a simple change:

# old
inputs = transform(X[:, :5, :])
targets = transform(X[:, 5:, :])

# new
inputs = transform(X[:, :-1, :]) # all but the last step
targets = transform(X[:, 1:, :]) # all but the first step

If you really want to stick to using the first 5 steps to predict the next 5, you need a sequence to sequence model. This involves adding a decoder LSTM that uses the hidden state produced from the encoder LSTM (thus having information from all time steps). Seq2seq also requires a for loop for the decoder which is super annoying.