Search code examples
pythonpytorchlstm

Converting lists of uneven size into LSTM input tensor


So I have a nested list of 1366 samples with 2 features each and varying sequence lengths that is supposed to be the input data for an LSTM. The labels are supposed to be a pair of values for each sequence, i.e. [-0.76797587, 0.0713816]. In essence the data looks like the following:

X = [[[-0.11675862, -0.5416186], [-0.76797587, 0.0713816]], [[-0.5115555, 0.25823522], [0.6099151999999999, 0.21718016], [-0.0022403747, 0.6470206999999999]]]

What I would like to do is convert this list into an input tensor. As I understand, LSTMs accept sequences of different lengths, so in this case the first sample has length 2 and the second has length 3.

Currently I'm trying to convert the list in the following way:

train_data = TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(Y, dtype=torch.float32))
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)

Though this produces the following error ValueError: expected sequence of length 5 at dim 1 (got 3)

I'm guessing this is because the first sequence has length five and the second length 3, which is not convertible?

How do I convert the given list into a tensor? Or am I thinking wrong about the way to train the LSTM?

Thanks for any help!


Solution

  • So as you said, the sequence length can be different. But because we work with batches, in each batch the sequence length has to be the same anyways. Thats because all samples are processed simultaneously. So what you have to do is to pad the samples to the same size by taking the length longest sequence in the batch and fill all other samples with zeros so that they have the same size. For that you have to use pytorch's pad functionn, like this:

    from torch.nn.utils.rnn import pad_sequence
    
    # the batch must be a python list containing the tensor samples
    sample_batch = [torch.tensor((4,2)), torch.tensor((2,2)), torch.tensor((5,2))]
    
    # pad all samples in the batch to the length of the biggest sample
    padded_batch = pad_sequence(sample_batch, batch_first=True)
    
    # get the new size of the samples and reshape it to (BATCH_SIZE, SEQUENCE/PAD_SIZE. INPUT_SIZE)
    padded_to = list(padded_batch.size())[1]
    padded_batch = padded_batch.reshape(len(sample_batch), padded_to, 1)
    

    Now all samples in the batch should have the shape (5,2) because the biggest sample had a sequence length of 5.

    If you dont know how to implement this with the pytorch Dataloader you can create a custom collate_fn:

    def custom_collate(batch):
        batch_size = len(batch)
    
        sample_batch, target_batch = [], []
        for sample, target in batch:
    
            sample_batch.append(sample)
            target_batch.append(target)
    
        padded_batch = pad_sequence(sample_batch, batch_first=True)
        padded_to = list(padded_batch.size())[1]
        padded_batch = padded_batch.reshape(len(sample_batch), padded_to, 1)        
    
        return padded_batch, torch.cat(target_batch, dim=0).reshape(len(sample_batch)
    

    Now you can tell the DataLoader to apply this function on your batch before returning it:

    train_dataloader = DataLoader(
            train_data,
            batch_size=batch_size,
            num_workers=1,
            shuffle=True,
            collate_fn=custom_collate    # <-- NOTE THIS
        )
    

    Now the DataLoader returns padded batches!