Search code examples
pythonpytorchpytorch-dataloader

How do I turn a numpy ndarray into a PyTorch dataset?


I have a numpy ndarray with a shape (16699, 128, 128), where each element is an image of 128 by 128 pixels, each image normalized to a range of 0 to 1. Now, to put the image into a neural network model, I have to take each element of the array, convert it to a tensor, and add one extra-dimension with .unsqueeze(0) to it to bring it to the format (C, W, H). So I'd like to simplify all this with the dataloader and dataset methods that PyTorch has to use batches and etc. How I can do it?

This is the method I have now:

epochs = 3

for epoch in range(epochs):
    for i in range(X):
        y = torch.from_numpy(y[i])
        x = torch.from_numpy(X[i]).unsqueeze(0)
        ...

Solution

  • One way is to convert X and y to two tensors (both with the same length), then wrap them in a torch.utils.data.TensorDataset.

    from torch.utils.data import TensorDataset, DataLoader
    
    batch_size = 128
    dataset = TensorDataset(torch.from_numpy(X).unsqueeze(1), torch.from_numpy(y))
    loader = DataLoader(dataset, shuffle=True, batch_size=batch_size)
    
    ...
    
    # training loop
    for epoch in range(epochs):
        for x, y in loader:
            # x is a tensor batch of images with shape (batch_size, 1, H, W)
            # y is a tensor with the corresponding labels
            ...