Search code examples
pythonpandaspytorchnormalization

Pytorch - How to normalize/transform data manually for DataLoader


I am following along with a LinkedInLearning tutorial for neural networks. I am trying to follow along using a different dataset than in the tutorial, but applying the same techniques to my own dataset. I am struggling with figuring out how to normalize/transform my data in the same way they do, because they are using some built in functionality that I do not know how to reproduce.

Here is an example of what they are doing:

from torchvision import datasets, transforms

mean, std = (0.5,), (0.5,)

# Create a transform and normalise data
transform = transforms.Compose([transforms.ToTensor(),
                            transforms.Normalize(mean, std)
                          ])

# Download FMNIST training dataset and load training data
trainset = datasets.FashionMNIST('~/.pytorch/FMNIST/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

they are creating a transform, and then just passing it straight into this FashionMNIST method, which seems to be doing some sort of automatic transforming for the trainset.

I want to do a similar thing, but for my dataset, there is no built in FashionMNIST method. How would I replicate it?

Here's what I'm doing/know how to do:

import pandas as pd

df = pd.read_csv('../input/sign-language-mnist/sign_mnist_train.csv')
trainloader = torch.utils.data.DataLoader(df, batch_size = 64, shuffle = True)

How would I go about applying the same transform to my df without the help of this built in FashionMNIST method?


Solution

  • You need to build a custom Pytorch dataset to put into your dataloader

    class MNistDataset:
        def __init__(self, df):
            self.df = self.custom_norm_function(df)
            self.device = T.device('cuda:0' if T.cuda.is_available() else 'cpu')
            
        def __len__(self):
            return len(self.df)
        
        def __getitem__(self, idx):
            image = self.df.loc[idx, 'image']
            label = self.df.loc[idx, 'label']
            return image, label
    
        def custom_norm_function(self, df):
            df = normalize(df)
            return df
    

    Where you define your "custom_norm_function" as needed. Then put it in your dataloader.

    dataset = MNistDataset(df)
    trainloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
    

    You can read more here -> https://pytorch.org/tutorials/beginner/data_loading_tutorial.html