Search code examples
pythonmachine-learningdatasetpytorchdataloader

Python Dataset Class + PyTorch Dataloader: Stuck at __getitem__, how to get Index, Label and so on during Testing?


I have a, maybe small problem but I am stuck for quite a while now. Hope someone can help me with that. I am currently on a Kddcup99 dataset which I like to train via DeepLearning (CNN Network)

I have a "Dataset" Class which includes the Panda Dataframe. Thus i split up into normal and validate dataset. So far, no problem. I load it into a Numpy vector, torch it to Tensor and then direct it to the DataLoader.

The Dataset Class has these two important classes for iterating through:

def __len__(self):
        return len(self.val_df)

def __getitem__(self, index):        
        img, target = self.val_df[index][:-1], self.val_df[index][-1]
        return img, target, index

Not in the class is the DataLoader string:

test_dataloader = DataLoader(datat.val_df, batch_size=10, shuffle=True)

In my Trainer Class i have a for loop which should iterate through the Dataloader:

with torch.no_grad():
            for data in dataloader:
                inputs, labels, idx = data
                inputs = inputs.to(self.device)

But it won't. I can't access the labels, index and such.

My question is now: Why? How can I access Labels, Index from the given Dataset via the Dataloader?

Thank you all for your help! Much appreciate it.


Solution

  • The first argument to DataLoader is the dataset from which you want to load the data, that's usually a Dataset, but it's not restricted to any instance of Dataset. As long as it defines the length (__len__) and can be indexed (__getitem__ allows that) it is acceptable.

    You are passing datat.val_df to the DataLoader, which is presumably a NumPy array. A NumPy array has a length and can be indexed, so it can be used in the DataLoader. Since you pass that array directly, your dataset's __getitem__ is never called, but the array itself is indexed, so every item is just data.val_df[index].

    Instead of using the underlying data for the DataLoader, you have to use the dataset itself (datat):

    test_dataloader = DataLoader(datat, batch_size=10, shuffle=True)